Nota Bene: This file accompanies the empirical study Housing Prices in Spain, nevertheless, the table of contents of this file and the paper do not follow the same order.
# Function will be used to format inline texts.
applyFormat <- function(text) {
paste('<span style="font-family: Courier New, monospace;">', text, '</span>', sep = "")
}
Wikipedia has a list with 1313 Spain cities by population, which will be used as an starting point to acquire information about real estate prices in the country.
The main idea of this project is to understand the key factors that drive the prices of housing. The data about prices and attributes are from fotocasa. The process of data collection is mainly run in python-webscraper.
In a first query of fotocasa, about 23.000 pages were returned. So extracting all those information is not feasible. Instead a subset of cities based on their population will be retrieved from the aforementioned list. From the 1313, 25% will be used to form the sample, which accounts for \(1313 * 0.25 \approx 329\) cities, which need to be random selected if one would like to make general statement for the whole population without biasing results.
As much of Spain population is concentrated in a few large cities, it does not make sense to give all cities equally like weights of being selected. If one believe that the price of housing is the result of supply and demand, and that both is a function of the population, one can use the relative percentage of population of unit i (city i) as probability of selection for that city.# Import the csv file with infomation of cities in it and create a new column with relative weights of population to be used as weights in the sample function
cities <- as.data.frame(read_csv('./01_RawDatasets/01_SpainCities.csv')) %>%
mutate(RelPopulation = Population / sum(Population))
# Creates now a sample without replacement of the dataframe
# We set the random seed to assure reproducibility. 23111992 (my birthday)
set.seed(23111992)
sample <- sample_n(cities, size = ceiling(nrow(cities) * 0.25),
replace = FALSE, weight = cities$RelPopulation)
# and save the data to use it in python
#write.csv('02_SampledCities.csv')
# print some values of the new sampled data
datatable(sample, options = list(pageLenght = 10))
#view(sample)
Next is used Python to retrieve data from the site mentioned at the beginning of this section.
In a first run, the Python-Module selenium was used to
query the cities on fotocase.es mentioned in Section 0.1. For each query, the website returns
a list over several pages with houses and apartments for the selected
city. For each ad returned in the list, the program scraped its url (see
Fig. 0.1) that leads the user to another page
where the attributes of the apartments are displayed in a more detailed
manner and saved each of the urls to a locally based database called
attributed.db into a table called REF. The process
lasted for 7 days, between 08.05.2023 to 15.05.2023, and a total of
around 59.000 urls were stored.
In a second run, using requests and
BeautifulSoup python modules, the attributes of the
housings were obtained. As the website recognized the webscraper as
roboter, rotating proxies needed to be used what slowed down the process
enormly and, as a consequence, it was not possible to query all the
59,000 urls. To insure randomization in the query process, by each query
the list with the urls were randomly shuffled and an url selected. The
process lasted for another 13 days.
Now, the attributes for the real estates saved in the database need to
be assigned a city, and they need to be matched to the the inital
dataset called sample, as defined in section 0.1. R has a package named
RSQLite that allows the user to handle databases, which
will be used to retrieve the information from the locally stored
database _attributes.db
# instatiate the database
conn <- dbConnect(SQLite(), './01_RawDatasets/attributes.db')
# the Attributes are saved in a table called ATTR
# Create an send a query to obtain the attributes from the table ATTR
query <- 'SELECT attributes FROM ATTR'
queried <- dbGetQuery(conn, query)
# In order for the user to get an idea about how the data was stored, we will print the first row of the dataframe 'queried'
print(queried[1, 1])
## [1] "{\"1\": {\"url\": \"https://www.fotocasa.es/es/comprar/vivienda/sevilla-capital/calefaccion-parking-jardin-trastero-ascensor-piscina/165669218/d?from=list\", \"city\": \"sevilla-capital\", \"price\": \"99499\", \"discount\": 0, \"elementary\": [\"1 hab.\", \"1 baño\", \"40 m²\", \"3ª Planta\"], \"Tipo de inmueble\": \"Apartamento\", \"Planta\": \"3ª planta\", \"Parking\": \"Privado\", \"Ascensor\": \"Sí\", \"Consumo energía\": \"C999 kW h m² / año\", \"Emisiones\": \"G999 kg CO₂ m² / año\"}}"
The output above is python dictionary converted to Json-format. It has as first key1 an unique ID, say, a number, and the value related to the ID is again a dictionary, which will be addressed to as the inner dictionary. The inner dictionary contains as first (key, value)-pair for all datapoints the url that was scraped; the second pair is the city where the housing is located.2. Next the reader will find another key named price and another named discount. The fifth key-value pair is “elementary” and a list with the attributes found in Fig. 2 where the numbers 1, 2, 3, 4 are circled. These are common to all units. For these attributes, there was no appropriate text to be scraped, the only information about the key was a picture displayed above the information (such as a bed, a shower and so on). Hence, the attribute values were saved to a list and distinguishing between them will be a part of the data-processing. Further characteristics that were scraped are the ones assigned numbers 5-15 in Fig. 1.2. Notice that these characteristics are not uniform for all apartments, so identifying the unique characteristics will also be a part of the work in later moments.
The Extra-Features (No. 16 in Fig. 1.2) were not collected, for they are very random and not housing specifict “characteristics”. For example, the ones just mentioned translate to: Wardrobes, flooring, home appliences, and so on.
"""Data structure: Python Dictionary"""
{ID: {url: "some_url",
city: "some_city",
elementary: "[attr1, attr2, …]",
named_attribute1: value1,
named_attribute2: value2,
.
.
.
}
}
The Json-data stored in the dataframe queried need to be
parsed into a format understood by R, this needs to be done
piecewise, otherwise R in return an error. The
data returned will be in a (named)-matrix form, where the row name is
the unit-ID as displayed by the python dictionary above, the
column-names are the key-values of the inner-dictionary and the entries
are the values in the inner-dictionary.
# Parse JSON with the frist units to in order to create the variable (dataframe)
# to hold the remainder data.
data <- fromJSON(queried$attributes[1])
# Create a dataframe
data <- as.data.frame(do.call(rbind, data))
for (i in 2:nrow(queried)) {
# get each row of the data frame queried at a time and Parse it to R-Format
queried.df <- fromJSON(queried$attributes[i])
# Transform it to a dataframe
queried.df <- as.data.frame(do.call(rbind, queried.df[1]))
# append it to the first dataframe
data <- bind_rows(data, queried.df)
}
# disconnect from the the base
dbDisconnect(conn)
datatable(data, options = list(pageLength = 10))
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
The output obtained is a dataframe with lists inside it. This is a consequence of how the data was saved to a python-dictionary.
Next, a list is created with only the non-numerical parts of the entries in the column elementary. I.e., for the list["1 hab", 1 baño", "40 m²", "3ª Planta"], the code will
return another list ["hab.", "baño", "m", "Planta"], this
list will be used to distinguish between the unique components stored in
that column.
list_elementary <- data[ , 'elementary']
# create a function to get the strings and apply it 'listwise'
text_extractor <- function(my_list){
text <- character()
for (single_list in my_list) {
# substitute the numbers by an empty string
string <- gsub('[0-9]', "", single_list)
# trim the returned string, for the case there is any empty space
string <- trimws(string)
# add the string to the vector 'text'
text <- c(text, string)
}
return(text)
}
text_vector <- unlist(lapply(list_elementary, text_extractor))
# identify the unique values:
print(unique(text_vector))
## [1] "hab." "baño" "m²"
## [4] "ª Planta" "habs." "baños"
## [7] "Bajos" "m² terreno" "Principal"
## [10] "Entresuelo" "Superior a Planta" "Sótano"
## [13] "Subsótano"
The output above translates to
It follows from the output above that the elements stored in the column ‘elementary’ belongs to the ’location of the housing—which will receive the name floor—or to the size of the housing— which wil be called size—or the size of the land—which will be called landsize—or to the number of rooms—called rooms&dmash;or last but not least, the number of bathrooms—which will be called bathrooms.
Now new columns need to be created in the dataframe in order to unlist the column elementary, and the values stored in that column need to be correctly assigned to the new created columns.keywords_floor <- c('Planta', 'Bajos', 'Principal', 'Entresuelo', 'Superior a Planta', 'Sótano', 'Subsótano')
keywords_rooms <- c('hab.', 'habs.')
keywords_bedrooms <- c('baño', 'baños')
# Create an empty dataframe with the size of the list, with entries as being NA and with columns [room, bathroom, size, landsize, floor]
mapping_df <- data.frame(matrix(NA, nrow = length(list_elementary), ncol = 5))
colnames(mapping_df) <- c('room', 'bathroom', 'size_sqm', 'landsize_sqm', 'floor')
# Iterate over the list_elementary
for (i in seq_along(list_elementary)) {
# and iterate over the elements of the list
for (j in seq_along(list_elementary[[i]])) {
# save the element in a variable for comparison
element <- list_elementary[[i]][j]
if (grepl('hab', element)) {
mapping_df$room[i] <- element
} else if (grepl('baño', element)) {
mapping_df$bathroom[i] <- element
} else if (grepl('m²', element) & !grepl('terreno', element)) {
mapping_df$size_sqm[i] <- element
} else if (grepl('m²', element) & grepl('terreno', element)) {
mapping_df$landsize_sqm[i] <- element
} else if (any(grepl(paste(keywords_floor, collapse = '|'),
element, ignore.case = TRUE))) {
mapping_df$floor[i] <- element
}
}
}
Next the elementary column is dropped from the dataset as it is no longer needed, and the lists stored inside the dataset will be unlisted. Notice how one can get access to the elements inside those lists:
data[1, 1][[1]] # this is how one obtain the element of the list
## [1] "https://www.fotocasa.es/es/comprar/vivienda/sevilla-capital/calefaccion-parking-jardin-trastero-ascensor-piscina/165669218/d?from=list"
# drop the elementary column
data <- data[ , !colnames(data) %in% 'elementary']
# Create a dataframe to store the unlisted values
unlisted_data <- data.frame(matrix(nrow = dim(data)[1], ncol = dim(data)[2]))
colnames(unlisted_data) <- colnames(data)
error <- list(NA, NA)
# Now, iterate over the columns of the data dataframe and save the unlisted values into this new 'unlisted_data' dataframe
for (i in 1:ncol(data)) {
for (j in 1:nrow(data)) {
tryCatch({
unlisted_data[j, i] = data[j, i][[1]][[1]]
}, error = function(e) {
})
}
}
# create an ID column for the dataframe
unlisted_data$ID <- seq(nrow(unlisted_data))
# create an ID column for the dataframe mapping_df
mapping_df$ID <- seq(nrow(mapping_df))
# We join both dataframes
data <- full_join(unlisted_data, mapping_df, by = 'ID')
Next a copy of the dataset data named
data_copy is created, for the case that any operation to
the orginal dataset returns an inappropriate value.
data_copy1 <- data
data <- data_copy1
# drop the url column, as it is no longer needed
data <- data[, -c(1)]
The dataset data and the dataset sample
need to be joined, so the cities in the dataset data are
assigned their autonomous regions and provinces. The method reads:
cities to store the ‘ID’ and
the ‘city’ column of the dataset data.
municipalities to store the
columns ‘ID’ and ‘Municipality’ of the dataset sample.
cities this new column is
called ASCII_Cities and in the dataframe called
municipalities this new column is called
ASCII_Municipalities.
ASCII_Cities and the entries stored in the column
ASCII_Municipalities in the municipalities dataframe.
Obtain the the index of the cities with min-distance by using
which.min function.
cities-dataframe.
cities
and sample in a dataframe named joined_frames.
And finally use the ID column to unite the datasets data
and joined_frames.
# First, create a subtset of dataset `data` with the name of the cities and their IDs
cities <- data[ , c('ID', 'city')]
# And a subset of the dataset `sample` with the name of the cities and their respective IDs
municipalities <- sample[ , c('ID', 'Municipality')]
# First of all, remove words like `capital from this dataframe and hifens`, as these are not found in the names of the cities stored in the dataset municipalities.
# Create a function to transform the words to Latin-Format
word_formatter <- function(text) {
# remove the accents
no_accent <- stri_trans_general(text, 'Latin-ASCII')
# Remove the word 'capital' (relevant for the cities-dataset)
no_capital <- gsub('capital', "", no_accent)
# remove any hyphen from the word (relevant for the cities-dataset)
no_hyphen <- gsub('-', " ", no_capital)
# remove any space between the words
no_space <- gsub("\\s+", "", no_hyphen)
# write the word in lower cases
lower_case <- tolower(no_space)
# Remove any ','
no_comma <- gsub(',', "", lower_case)
return(no_comma)
}
# save this transformation to a new column (ASCII_Municipality) in the dataset municipalities
municipalities$ASCII_Municipality <- sapply(municipalities$Municipality, word_formatter)
# apply the same function to the dataset cities
cities$ASCII_Cities <- sapply(cities$city, word_formatter)
The cities in both datasets are matched based on the smallest string-distance between the datapoints.
# We first cretate a new column in the cities dataframe to hold the matched ID's
cities$matched_ID <- NA
# We now iterate over each row in the dataframe cities and match the names
# in the column ASCII_Cities with the names in the column ASCII_Municipality in the dataframe municipalities and store the municipality-ID
for (i in 1:nrow(cities)) {
distances <- stringdist(cities$ASCII_Cities[i],
municipalities$ASCII_Municipality)
# obtain the minimum distance index
min_distance <- which.min(distances)
# query the index in the data set municipalities and obtain ID
matched_ID <- municipalities[min_distance, 'ID']
# Store the ID into the dataframe cities
cities$matched_ID[i] <- matched_ID
}
#view(cities)
Now use this new column matched_ID to join the cities and
the sample dataframe.
# copy the sample dataframe
sample_copy <- sample
colnames(sample_copy)[1] <- 'matched_ID'
joined_frames <- left_join(cities, sample_copy, by = 'matched_ID')
In the next snippet we return only non-repeated values and check that each city has been correctly assigned.
view_joined_frames_unique <- joined_frames[!duplicated(joined_frames$city), ]
#view(view_joined_frames_unique)
Notice that the columns [city, ASCII_Cities, matched_ID] of
the joined_frames are no further needed. Thus, before joining the
datasets, those columns will be dropped.
# Drop the columns mentioned from the dataframe joined_frames
to_be_dropped <- c('city', 'ASCII_Cities', 'matched_ID')
joined_frames <- joined_frames[ , -which(names(joined_frames) %in% to_be_dropped)]
# and join this dataframe to the dataframe 'data' based on the column ID
data <- left_join(data, joined_frames, by = 'ID')
#view(data)
length(unique(data$Municipality))
## [1] 172
nrow(data)
## [1] 13015
Notice that in the dataset that will be imported, cities like La
Codesera are stored as Codesera, La. This pattern is repeated
across all values in this dataset. First of all, the order of the words
will be reversed. This will help with the computation of the
min-string-distance between the cities in this dataset, which will be
called salary and the dataset municipalities,
which is known by now, to store the cities of the column named
municipality in the datset sample. This dataset
contains the mean-salary per city retrieved from the instituto nacional de estadística (INE)
and were also obtained by scrapping-method used python.
salary <- read_csv('./01_RawDatasets/03_Mean_Salary.csv')
## Rows: 7703 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Municipality, Provincia, Renta
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Let us first have a look at the salary dataset.
datatable(salary, options = list(pageLength = 10))
# create a function to rearrange the words Albuera, La will becomme La Albuera
corrected_cities <- function(element) {
splits = strsplit(element, ", ")[[1]]
if (length(splits) == 2){
paste(splits[2], splits[1], sep = " ")
} else {
splits[1]
}
}
salary$Municipality <- sapply(salary$Municipality, corrected_cities)
One needs to have a little bit more care with the dataset
salary as it contains many more cities than the
sample or municipality dataset do. Hence, the
iteration and comparison of the cities must occcur in the dataset that
has the smallest number of cities, because the stringdist() will
always return a value, and if the largest dataset is used, some cities
that are not found in the smallest dataset will receive any other value
found most similar in the smallest dataframe.
municipalities, and the comparison is made
according to cities i in the dataset municipalities and all
those cities in the dataset salary. Thus, first, a unique
ID for each unique municipality in the dataset salary needs to be
created. These ID’s will be the ones that will be passed into a new
column into the dataset municipalities, and used to join
both datasets.
salary$Municipality <- as.factor(salary$Municipality)
salary$ID <- as.numeric(salary$Municipality)
# exclude any duplicate from the dataset
salary <- salary[!duplicated(salary$ID, ), ]
# create a column which is the union between the city and the province
salary$MunProv <- paste(salary$Municipality, salary$Provincia, sep = " ")
municipalities <- sample[ , c(1, 2, 4)]
# do the same for the dataframe municipality
municipalities$MunProv <- paste(municipalities$Municipality, municipalities$Province, sep = " ")
Next, iterate over the dataset municipalities and compute
the min-string-distance between its cities-province combination and the
cities-province stored in the dataset salary.
municipalities$matched_ID <- NA
for (i in 1:nrow(municipalities)) {
distance = stringdist(stri_trans_general(municipalities$MunProv[i], 'Latin-ASCII'), stri_trans_general(salary$MunProv, 'Latin-ASCII'), method = 'jw')
min_distance = which.min(distance)
municipalities$matched_ID[i] <- salary$ID[min_distance]
}
# Change the column-name from the salary-dataframe from ID to matched ID in order to join both dataframes
colnames(salary)[4] <- 'matched_ID'
joined_frames <- left_join(municipalities, salary, by = 'matched_ID')
# we now do visual inspection to assure correctness to the assignments
#view(joined_frames)
Here are the words which have been incorrectly assigned:
| ID | municipalities-dataframe | salary-dataframe | Correct salary-ID |
|---|---|---|---|
| 3 | Valencia | Palencia | 6897 |
| 11 | Alicante | Albatana | 137 |
| 413 | San Quirico de Tarrasa (San Quirze del Vallés) | Sant Quirze de Besora | 5982 |
| 51 | Parla | Malagón | 4970 |
| 15 | Gijón | San Martín de Oscos | 2959 |
| 184 | Rentería (Errenteria) | Urnieta | 2517 |
| 218 | Villajoyosa | Villatoya | 3660 |
| 972 | Cobeña | Cardeña | 2046 |
| 452 | La Eliana | Lantadilla | 3400 |
| 18 | La Coruña | La Carlota | 6 |
| 21 | Tarassa (Terrassa) | La Garriga | 6444 |
| 585 | Cambadons | Tapia de Casariego | 1465 |
| 292 | San Antonio Abad (Sant Antoni de Portmany) | San Antonio de Benagéber | 5902 |
| 158 | Santurce | Santomera | 6158 |
| 352 | Ribarroja del Turia | Riola | 5520 |
| 33 | San Sebastián (Donostia) | San Sebastián de los Reyes | 2285 |
| 314 | Puebla de Vallbona (La Pobla de Vallbona) | Puebla de Valles | 3562 |
| 264 | Alacuás (Alaquàs) | Arconada | 154 |
| 128 | San Vicente del Raspeig (Sant Vicent del Raspeig) | San Vicente del Valle | 5992 |
| 86 | San Baudilio de Llobregat (Sant Boi de LLobregat) | Sant Hipòlit de Voltregà | 5905 |
| 430 | Benicasim (Beincàssim) | Benimassot | 1060 |
| 608 | Montroig | Montferri | 4453 |
| 222 | Petrel (Petrer) | Pedreguer | 5102 |
| 467 | Llodio (Laudio) | Okondo (Artziniega) | 3754 |
| 29 | Pamplona (Iruña) | Lana | 4940 |
| 192 | Vendrell | Vinebre | 2449 |
| 736 | Alcácer(Alcàsser) | Ayuela | 255 |
| 967 | Cee | Cea | 1896 |
| 145 | Villarreal (Vila-real) | Ullastrell | 7095 |
| 496 | Fuenterrabía (Hondarribia) | Olaberria | 3150 |
| 82 | Torrente | Cofrentes | N.A. |
| 398 | Alfaz del Pi | Alar del Rey | 3390 |
| 17 | Vitoria | Ziordia | 7567 |
| 333 | San Juan de Alicante (Sant Joan d’Alacant) | Beniardà | 5936 |
| 101 | Avilés | Belmonte de Miranda | 789 |
| 970 | Puentedeume | Fuentes de Carbajal | 2827 |
| 884 | Puebla de Farnals | Puebla de Yeltes | 5329 |
| 410 | Puzol | Portilla | 5228 |
| 275 | Jávea | Redován | 5473 |
| 587 | Llanes | Tapia de Casariego | 6393 |
# Note that this ids are not the ids that were incorrectly assigned from the dataset salary to the municipalities.
# There are the ID's that in the dataset municipalities for that identifies the housings in this dataset. And for these ID's (or housings) there was wrong assignments in the columns 'matched_ID'.
wrong_assigned_ids <- c(3, 11, 413, 51, 15, 184, 218, 972, 452, 18, 21, 585, 292, 158, 352, 33,
314, 264, 128, 86, 430, 608, 222, 467, 29, 192, 736, 967, 145,
496, 82, 398, 17, 333, 101, 970, 884, 410, 275, 587)
# Correct ids: these are the IDs that should be assigned from the salary dataset to the matched_ID-column in the dataset municipalities.
correct_ids <- c(6897, 137, 5982, 4970, 2959, 2517, 3660, 2046, 3400, 6, 6444, 1465, 5902, 6158,
5520, 2285, 3562, 154, 5992, 5905, 1060, 4453, 5102, 3754, 4940, 2449,
255, 1896, 7095, 3150, NA, 3390, 7567, 5936, 789, 2827, 5329, 5228,
5473, 6393)
# iterate over the wrong_assigned IDs and assign the correct ID's to those values:
j = 0
for (i in wrong_assigned_ids) {
j = j + 1
municipalities[municipalities$ID == i, 'matched_ID'] <- correct_ids[j]
}
# Now we join again the salary and municipalities dataframe based on the their matched IDs:
joined_frames <- left_join(municipalities, salary, by = 'matched_ID')
#view(joined_frames)
By visual inspection, the user can be assured that the cities have now
been correctly matched. Next, the municipalities dataframe
and the salary one have the matched_id in
common, which will be used to append the the column renta from
the salary dataframe to the municipalities.
Recall that the Municipality column in
municipalities dataset and in the data-dataset
are the same, and is tus used to append the municipalities (a subset of
it containing only the column renta and municipalities) to the
data-dataset.
# add the renta-column from the dataset salary to the dataset municipalities based on the matched_ID
salary_subset <- salary[, c(3, 4)]
municipalities <- left_join(municipalities, salary_subset, by = 'matched_ID')
# obtain only the columns from municipality that are relevant to us
municipalities_subset <- select(municipalities, 'Municipality', 'Renta')
# Append this subset based on the names of the cities `Municipality` column to our main dataset `data`
data <- left_join(data, municipalities_subset, by = 'Municipality')
data <- as_tibble(data)
The information about the mean-salary per city (renta media) was finally
added to the main dataset: data. There are still other
datasets that need to be joined to the data one. This will
be done in the following sections.
The next dataset will be imported was retrived (scrapped) from 15MPedia.
Among other information, this dataset contains, is the one about the
surface of several cities in Spain. We would like to have this
information in order to compute the population density for the cities
that will be studied in further section. The information imported from
04_Surface will be stored in a dataframe called
density_df.
data contains more updated information, as
the population refers to 2022. Given that the surface of a city is
unlikely to be changed in these two years (unlike the pop), we can
compute the current density based on old-values of the surface.
density_df <- read.csv('./01_RawDatasets/04_SurfacePerCity.csv', sep = ';')
# Notice we only want the first (X), the second (Municipio) and the 7th column (Superficie..km..) of this dataframe
density_df <- select(density_df, X, Municipio, Superficie..km..)
# rename the columns in order to make it easier for us to access them:
colnames(density_df) <- c('ID', 'Municipality', 'Surface')
# have a look at the dataset
datatable(density_df, options = list(pageLength = 10))
Just as before, the city names will be matched used the approach of the
min-string-distance. The dataset municipalities is
again created using the sample dataframe, as defined in Section 0.1. In the dataframe called
municipalities a new column named mached_ID is
created. Next, iteration over the cities occurs and they are matched to
the ones stored in the density_df dataframe. The
density_df was imported with an ID in it, which will be
passed to the matched entries in the matched_ID column of the
muncipaplities-dataframe, such that the dataframes can be
joined based on these IDs.. In a latter stage the correctness of the
assignments needs to be done visually.
# create the municiaplities dataframe
municipalities <- select(sample, ID, Municipality)
# create an empty column to store the matched IDs
municipalities$matched_ID <- NA
for (i in 1:nrow(municipalities)) {
distances = stringdist(stri_trans_general(density_df$Municipality, 'Latin-ASCII'),
stri_trans_general(municipalities$Municipality[i], 'Latin-ASCII'),
method = 'osa')
min_distance <- which.min(distances)
municipalities$matched_ID[i] <- density_df[min_distance, 'ID']
}
# Bring the cities from the density_df dataset into the municipality, first we will need to renamed the 'ID' column of the density to be able to unite both
colnames(density_df)[1] <- 'matched_ID'
# unit both datsets
joined_frames <- left_join(municipalities, density_df, by = 'matched_ID')
# We now use this joined_frames to viasually inspect whether the cities have been correctly assigned.
#view(joined_frames)
Here are the cities wrongly assigned:
| ID | municipalities-dataframe | density_df-dataframe | correct Density-ID |
|---|---|---|---|
| 1158 | Puebla de la Calzada | Puebla de la Reina | NA |
| 16 | Hospitalet de LLobregat | El Prat de LLobregat | NA |
| 334 | Erandio | Grado | NA |
| 57 | Cádiz | Candín | NA |
| 413 | San Quirico de Tarrasa | San Tirso de Abres | NA |
| 112 | Villanueva y Geltrú (Vilanova i la Geltrú) | Villanueva del Rey | 4170 |
| 174 | Blanes | Llanes | NA |
| 184 | Rentería (Errenteria) | Méntrida | 4337 |
| 337 | Olesa de Montserrat | Ossa de Montiel | NA |
| 839 | San Fulgencio (Sant Fulgencio) | San Asension | NA |
| 710 | El Grove | El Gordo | NA |
| 972 | Cobeña | Cobeta | NA |
| 665 | Echévarri (Etxebarri) | Estépar | NA |
| 89 | Ceuta | Cella | NA |
| 8 | Palma (Palma de Mallorca) | Cala | 380 |
| 452 | La Eliana (L’Eliana) | La Olivaa | NA |
| 591 | Santa Cruz de Bezana | Santa Cruz de Mudela | NA |
| 124 | Granollers | Brañosera | NA |
| 309 | Molins de Rey | Olías del Rey | NA |
| 292 | San Antonio Abad (Sant Antoni de Portmany) | Santa Combra | 909 |
| 516 | Masamagrell (Massamagrell) | Benamaurel | NA |
| 370 | San Juan de Aznalfarache | San Juan de la Nava | NA |
| 850 | El Sauzal | El Saucejo | NA |
| 21 | Tarrasa (Terrasa) | Tàrrega | 1989 |
| 158 | Santurce (Santurtzi) | Santa Fe | 3772 |
| 537 | Benetúser (Benetússer) | Bonete | NA |
| 1182 | Anglés (Anglès) | Angüés | NA |
| 189 | Burjasot (Burjassot) | Barjas | NA |
| 97 | Guecho | Nueno | NA |
| 563 | Ogíjares | Mijares | NA |
| 314 | Puebla de Vallbona (La Pobla de Vallbona) | Puebla de Valles | 4244 |
| 264 | Alacuás (Alaquàs) | Alcaraz | NA |
| 351 | Zarauz (Zarautz) | Zarazoga | NA |
| 555 | Canet de Mar | Lloret de Mar | NA |
| 377 | Valle de Egüés (Egües, Eguesibar) | Valle de Mena | 2748 |
| 38 | Castellón de la Plana (Castelló de la Plana) | Castrejón de la Peña | 1091 |
| 263 | Durango | Aranga | NA |
| 41 | San Cristóbal de La Laguna (La Laguna) | San Cristóbal de Entreviñas | 1243 |
| 387 | Arroyo de la Encomienda | Baños de la Encina | NA |
| 81 | Melilla | Sevilla | NA |
| 86 | San Baudilio de Llobregat (Sant Boi de LLobregat). | San Martín de la Vega | NA |
| 554 | Palau-solità i Plegamans | Peralta de Calansaz | NA |
| 814 | Abanto y Ciérvana (Abano Zierbena) | La Cierva | NA |
| 304 | Paiporta | Pájara | NA |
| 461 | El Astillero (Real Astillero) | El Sotillo | NA |
| 467 | Llodio (Laudio) | LLíria | 3834 |
| 167 | Alcantarilla | Alcántara | NA |
| 111 | Torremolinos | Montemolín | NA |
| 736 | Alcácer (Alcàsser) | Alcocer | NA |
| 967 | Cee | Cea | 2531 |
| 426 | La Zubia | La Fueva | NA |
| 851 | Catral | Carral | NA |
| 154 | Mairena del Aljarafe | Mairena del Alcor | NA |
| 776 | San Antonio de Benagéber (Sant Antoni de Benaixeve) | Sant Antoni de Portmany | NA |
| 1095 | La Puebla de Alfindén | La Puebla de Valverde | NA |
| 214 | Azuqueca de Henares | Yunquera de Henares | NA |
| 496 | Fuenterrabía (Hondarribia) | Fuenferrada | 4748 |
| 425 | Archena | Marchena | NA |
| 231 | Barberá del Vallés | La Roca del Vallès | NA |
| 47 | Lérida (LLeida) | Mérida | 367 |
| 360 | Vilaseca (Vila-seca) | Valseca | NA |
| 22 | Badalona | Baraona | NA |
| 398 | Alfaz del Pi (L’Alfàs del Pi) | Alcalá del Río | NA |
| 998 | Moncófar (Moncofa) | Monóvar | NA |
| 155 | Figueras (Figueres) | Viguera | NA |
| 17 | Vitoria (Vitoria-Gasteiz) | Villoria | 219 |
| 333 | San Juan de Alicante (Sant Joan d’Alacant) | San Juan de Plan | NA |
| 510 | La Algaba | La haba | NA |
| 346 | Ibi | Ibias | 2304 |
| 114 | Castelldefels | Castelldans | NA |
| 422 | Humanes de Madrid | Linares de Mora | NA |
| 88 | Fuengirola | Quiroga | NA |
| 999 | Andorra | Añora | 759 |
| 1032 | Canet de Berenguer (Canet d’en Berenguer) | Canal de Berdún | NA |
| 867 | Reinosa | Reina | NA |
| 242 | San Pedro de Ribas (Sant Pere de Ribes) | San Pedro de Rozados | 3545 |
| 146 | Mollet del Vallés (Mollet del Vallès) | Cortes de Pallás | NA |
| 237 | Lejona (Leioa) | Lena | NA |
| 486 | Castilleja de la Cuesta | Castillejo de Iniesta | NA |
| 884 | Puebla de Farnals (La Pobla de Farnals) | Puebla de Valles | NA |
| 410 | Puzol (Puçol) | Ourol | NA |
| 624 | Vilanova del Camí | Vilanova de Sau | NA |
| 953 | Viladecaballs (Viladecavalls) | Valdecaballeros | NA |
| 62 | Gerona | Gerana | NA |
| 76 | Cornellá de Llobregat | El Prat de Llobregat | NA |
| 375 | Moncada | Monda | NA |
| 69 | Las Rozas de Madrid | Las Rozas de Valdearroyol | 2485 |
Differently from the results obtained in Section 1.2. many of the cities in the dataset
municipalities could not be found in the dataset
density_df. To solve this problem, the following is
done:
density_df by visual inspection.
densifty_df (more about this latter in Section 1.3.2)
wrongly_assigned_ids <- c(112, 184, 8, 292, 21, 158, 314, 377, 38, 41, 467,
967, 496, 47, 17, 346, 999, 242, 69)
correct_ids <- c(4170, 4337, 380, 909, 1989, 3772, 4244, 2748, 1091, 1243, 3834,
2531, 4748, 367, 219, 2304, 759, 3545, 2485)
# Now we iterate over the wrongly assigned IDs and assign the correct ID to the column `matched_ID` in the municipalities dataframe
j = 0
for (i in wrongly_assigned_ids) {
j = 1 + j
municipalities[municipalities$ID == i, 'matched_ID'] = correct_ids[j]
}
Wikipedia normally has detailed information on cities. Notice from Figure 1.1 that the urls that lead the users to
that pages are the same but for the name of the cities at the very end.
We will use this feature to retrieve information about the cities
(surface of the cities) that could not be found in the dataset
04_SurfacePerCity.csv (or density_df.)
04_NASurface.csv.
na_cities <- read_csv('./01_RawDatasets/04_NASurface.csv')
## Rows: 68 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): NA_municipality
## dbl (1): ID
## lgl (1): status
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
datatable(na_cities, options = list(pageLength = 10))
The names in parenthesis in the table above are not need. These names are another way of writing the names of that cities (dependent on whether is catalan, basque, valencian and so on.). Hence, in a first step, the names in parenthesis will be deleted, next another column is created multiple word-names cities are joined by ’_’ (underscore).
na_cities <- mutate(na_cities,
single_names = gsub(" \\(.*\\)", '', NA_municipality),
united_names = gsub(' ', '_', single_names))
Only two columns of this dataframe is needed for the webscraping: The ID, so one can match the cities in a later stage and the united_names.
# keep only the cited columns
na_cities <- select(na_cities, ID, united_names)
Now the process of webscrapping can start:
base_url <- "https://es.wikipedia.org/wiki/Puebla_de_la_Calzada"
#base_url <- "https://es.wikipedia.org/wiki/San_Fulgencio"
content <- read_html(base_url)
# get table with information about the surface
table <- html_nodes(content, ".infobox.geography.vcard")
table_contents <- as.data.frame(html_table(table, fill = TRUE))
#view(table_contents)
datatable(table_contents, options = list(pageLength = 10))
If one inspects the table above, it can be seen that one of the rows in
the first column contains the word superficie. This is exactly
the information needed. Thus, this process will be repeated for each
city in the dataframe na_cities and if the word superficie
is found in the first column, the other columns will be checked for the
superficie-value (given in km2) and these values will be stored a new
column in the dataframe na_cities, that will be named
superficie.
na_cities$superficie <- NA
for (i in na_cities$united_names) {
tryCatch({
base_url <- paste0('https://es.wikipedia.org/wiki/', i)
# query url and search for the table containing the information about the surface
content <- read_html(base_url)
table <- html_nodes(content, ".infobox.geography.vcard")
table_contents <- as.data.frame(html_table(table, fill = TRUE))
# We now use regular expression to check whether the word 'superficie' can be found in the first column of the dataframe
result <- sum(grepl('superficie', table_contents[ , 1], ignore.case = TRUE))
index <- which.max(grepl('superficie', table_contents[ , 1], ignore.case = TRUE))
# if the regular expression returns true for any index, than the sum will be either one or more than one. It if its more than one, than we cannot know for sure which row has the information we are looking for. If it is one, than it is an unique row that contains information about the superfice. It if is zero, the request did not work correctly.
if (result == 1) {
# We now iterate over the columns of the dataframe to make sure we obtain as value a word containing 'km' in it
for (j in 1:ncol(table_contents)){
if (grepl('km', table_contents[index, j], ignore.case = TRUE) == TRUE) {
na_cities[na_cities$united_names == i, 'superficie'] <- table_contents[index, j]
break
}
}
}
}, error = function(e) {
print('Url-not uniquely identifiable')
})
}
## [1] "Url-not uniquely identifiable"
## [1] "Url-not uniquely identifiable"
## [1] "Url-not uniquely identifiable"
datatable(na_cities, options = list(na_cities))
Here are the cities for which an uniquely identifiable url could not be found:
na_cities[is.na(na_cities$superficie),]
## # A tibble: 6 × 3
## ID united_names superficie
## <dbl> <chr> <chr>
## 1 839 San_Fulgencio <NA>
## 2 972 Cobeña <NA>
## 3 89 Ceuta <NA>
## 4 263 Durango <NA>
## 5 81 Melilla <NA>
## 6 167 Alcantarilla <NA>
The information missing is now manually searched 6 cities:
| City | Superficie |
|---|---|
| San Fulgencio | 19,75 km² |
| Cobeña | 20,8 km² |
| Cueta | 18,5 km² |
| Durango | 10,79 km² |
| Melilla | 12,3 km² |
| Alcantarilla | 16,24 km² |
ids <- c(839, 972, 89, 263, 81, 167)
superficies <- c('19,75 km²', '20,8 km²', '18,5 km²', '10,79 km²', '12,3 km²', '16,24 km²')
# We now iterate over the ids and assign the surface to those cities
j = 0
for (i in ids) {
j = j + 1
na_cities[na_cities$ID == i, 'superficie'] = superficies[j]
}
datatable(na_cities, options = list(pageLength = 10))
Now that all the infomation required is available, the datasets
municipalities and density_df still need to be
joined. Recall that for some wrongly assigned values in the column
matched_ID a correction has been made.
density_df were also not corrected. But they will in a
moment. First join the datasets (also the incorrectly assigned values):
joined_frames <- left_join(municipalities, density_df, by = 'matched_ID')
# Now, we need to correct the surface for those values those cities that are stored in the `na_cities` dataframe. That dataset and the municipaliy one share the same ID columns. We will use this to correct the values in the surface-column of the joined dataframe
for (i in na_cities$ID){
joined_frames[joined_frames$ID == i, 'Surface'] = na_cities[na_cities$ID == i, 'superficie']
}
# Note that we only need to keep the columns 'Surface' and 'Municipality.x' from the joined frame
joined_frames <- select(joined_frames, Surface, Municipality.x)
colnames(joined_frames)[2] <- 'Municipality'
The name of the municipalities in the dataframe
joined_frames are exactly the same names stored in the
data-dataset. So the column Municipality will be
used to join both dataframes.
data <- left_join(data, joined_frames, by = 'Municipality')
# create a copy of the data, in the case an operation returns innapropriate values
data_copy2 <- data
# this is the second time we make a (backup) copy
data <- data_copy2
Finally the surface per city has been added to the main dataset
data.
There still are two further datasets that shall be part of the analysis. One about the criminality and other about the rate of unemployment per province at a province level.
sample-dataframe using the
string-min-distance. For such, a new dataset called
provinces will be created, this stores the same provinces
as the ones found in the sample dataframe. The
unemployment-rate dataset will be first stored into a dataframe called
unemployment_df. The reader can get a glimpse on the
dataset by inspecting the output below:
# Import the dataset about the total rate of unemployment per province
unemployment_df <- read.csv('./01_RawDatasets/05_TasaParo.csv', sep = ';', encoding = 'latin1')
datatable(unemployment_df, options = list(pageLength = 10))
Next, so formating will be applied to the dataset above, such as excluding the numbers from the names, and making double-named cities more comparable in order to increase accuracy at the computation of the min-string-distance.
# Exclude the numbers from the names
unemployment_df$Provincias <- gsub("\\d+ ","", unemployment_df$Provincias)
# Note that some entries like alicante has two names Alicante/Alacant. We only want the first one
unemployment_df$Provincias <- gsub('/.*', "", unemployment_df$Provincias)
# Create an ID column in the unemployment dataset
unemployment_df$ID <- as.numeric(as.factor(unemployment_df$Provincias))
# change the name of the province Araba stored in the dataset unemployment to Álava, Lleida to Lérida, which are other names to the same provinces. This will help us by the matching, as this is the name stored in the sample/provincies dataframe
unemployment_df[unemployment_df$Provincias == 'Araba', 'Provincias'] = 'Álava'
unemployment_df[unemployment_df$Provincias == 'Lleida', 'Provincias'] = 'Lérida'
# Create a dataset with the provinces from the sample dataset
provinces <- as.data.frame(unique(sample$Province))
colnames(provinces) <- 'provinces'
# Create an ID column to store the matched ids in the dataset provinces
provinces$ID <- NA
# We now iterate over the dataset provinces and compute the min-distance among all the provinces in the dataset unemployment_df
for (i in 1:nrow(provinces)) {
distances <- stringdist(unemployment_df$Provincias, provinces$provinces[i], method = 'lcs')
min_distance <- which.min(distances)
provinces$ID[i] <- unemployment_df$ID[min_distance]
}
# unite both dataframes now based on the ID column
joined_frames <- left_join(provinces, unemployment_df, by = 'ID')
#view(joined_frames)
After checking for the correctness of the assignments, the ID columns is
used to obtain the total (unemployment rate) column. Next, this column
is added to the provinces dataset.
unemployment_df <- select(unemployment_df, ID, Total)
provinces <- left_join(provinces, unemployment_df, by = 'ID')
colnames(provinces)[c(1, 3)] <- c('Province', 'UnempRate')
provinces <- provinces[, -2]
Notice that the provinces dataframe and the
data dataframe share the same name for the provinces, which
allows joining both datasets based on the province-names (recall that by
now, the dataset provinces already contains the information
about the unemployment rate.)
data <- left_join(data, provinces, by = 'Province')
The output below shows how the dataset 06_crimininality is formatted. It
will be stored in a variable (dataframe) called
criminality_df.
# First import the dataset
criminality_df <- read.csv('./01_RawDataSets/06_Criminality.csv', skip = 5, encoding = 'latin1')
datatable(criminality_df, options = list(pageLength = 10))
Notice that the dataframe is organized as follows:
The second column will be shifted by -1 so the name of the observations and the values about criminality stay in the same row, then the rows containing NA values (or null-strings) in the second column are dropped.
shifted_column <- c(criminality_df$Enero.diciembre.2022[-1], NA)
criminality_df %>%
mutate(
shifted_column = c(Enero.diciembre.2022[-1], NA),
) %>% # shitf the second column by -1 unit and create a column named shifted_column
select(-Enero.diciembre.2022, -X.1) %>% # drop the rows in do not need anymore
filter(grepl("^\\d", shifted_column) == TRUE) -> criminality_df # return only rows with digits
# Now change the name of the columns
colnames(criminality_df) <- c('province', 'total_criminality')
# Drop the rows with national total and the foreign total
criminality_df <- filter(criminality_df, !(province == 'Total Nacional' | province == 'En el extranjero'))
Next a little webscraping is done in order to obtain the population of the provinces. The information is stored into the into the criminality dataset. They will be needed in the data-analysis section. The table can be found in wikipedia and is illustrated in Fig. 1.2. Click to be directed to the table
base_url <- "https://es.wikipedia.org/wiki/Anexo:Provincias_y_ciudades_autónomas_de_España"
content <- read_html(base_url)
# get table with information about the surface
table <- html_nodes(content, 'table')[[1]]
as.data.frame(html_table(table, fill = TRUE)) %>%
select(Puesto, Nombre, Población) %>%
mutate(
prov_pop = gsub("\\..*", "", Población),
prov_pop = gsub("&.*?0", "", prov_pop)
) %>%
filter(!Puesto == 'Total') %>% # lets drop the row with the total
select(-Población) -> provpop_df # and the column pobablación, as we have created another column prov_pop with cleaned data
Now join the datasets provpop_df and the
criminality_df. As the only column they have in common is
the province names, one need to match them once again uing the method
stringdist()
function.
# We create a ID column in the dataframe df_criminality to store the matched IDs of the provinces
criminality_df$ID <- NA
# now, for each city in criminality_df, we apply the stringdist function
for (i in 1:nrow(criminality_df)) {
distances <- stringdist(provpop_df$Nombre, criminality_df$province[i], method = 'lcs')
min_distance <- which.min(distances)
criminality_df$ID[i] <- provpop_df[min_distance, 'Puesto']
}
# we now join both daframes
colnames(provpop_df)[1] <- 'ID'
joined_frames <- left_join(criminality_df, provpop_df, by = 'ID')
# Inpect the dataframe for correctness
#view(joined_frames)
# Assignment was correct for all cities
criminality_df <- left_join(criminality_df, provpop_df, by = 'ID')
criminality_df <- select(criminality_df, -Nombre) # select all the columns but the Nombre, otherwise we will have the name of the provinces twice.
# create column with criminality per 100,000
criminality_df <- mutate(criminality_df,
crime_rate =((as.numeric(total_criminality) / as.numeric(prov_pop))) * 1e5)
The assignment was correct for all cities. Next, datasets
criminality_df will be appended to the
province, which contains the unique-provinces as given in
the dataset data. Again the datasets are joined using the
method of the minimum string distance.
provinces <- as.data.frame(unique(data$Province))
colnames(provinces) <- 'provinces'
# Create column to store id
provinces$ID <- NA
Let’s first inspect how the provinces names in both dataframes are:
unique(sort(criminality_df$province))
## [1] "Albacete" "Alicante/Alacant" "Almería"
## [4] "Araba/Álava" "Asturias" "Ávila"
## [7] "Badajoz" "Balears (Illes)" "Barcelona"
## [10] "Bizkaia" "Burgos" "Cáceres"
## [13] "Cádiz" "Cantabria" "Castellón/Castelló"
## [16] "Ceuta" "Ciudad Real" "Córdoba"
## [19] "Coruña (A)" "Cuenca" "Gipuzkoa"
## [22] "Girona" "Granada" "Guadalajara"
## [25] "Huelva" "Huesca" "Jaén"
## [28] "León" "Lleida" "Lugo"
## [31] "Madrid" "Málaga" "Melilla"
## [34] "Murcia" "Navarra" "Ourense"
## [37] "Palencia" "Palmas (Las)" "Pontevedra"
## [40] "Rioja (La)" "Salamanca" "Santa Cruz de Tenerife"
## [43] "Segovia" "Sevilla" "Soria"
## [46] "Tarragona" "Teruel" "Toledo"
## [49] "Valencia/València" "Valladolid" "Zamora"
## [52] "Zaragoza"
unique(sort(provinces$provinces))
## [1] "Álava" "Albacete" "Alicante"
## [4] "Almería" "Badajoz" "Barcelona"
## [7] "Burgos" "Cáceres" "Cádiz"
## [10] "Cantabria" "Castellón" "Ciudad Real"
## [13] "Córdoba" "Cuenca" "Granada"
## [16] "Guadalajara" "Huelva" "Huesca"
## [19] "Jaén" "Las Palmas" "León"
## [22] "Lugo" "Madrid" "Málaga"
## [25] "Melilla" "Navarra" "Palencia"
## [28] "Pontevedra" "Principado de Asturias" "Salamanca"
## [31] "Santa Cruz de Tenerife" "Segovia" "Sevilla"
## [34] "Tarragona" "Teruel" "Toledo"
## [37] "Valencia" "Valladolid" "Zamora"
## [40] "Zaragoza"
Notice that in the first dataset each city has a single name (unlike the
second one). Those names are made comparable by changing the names in
the dataset criminality_df.
to_be_renamed = c('Araba/Álava', 'Alicante/Alacant', 'Castellón/Castelló', 'Valencia/València', 'Palmas (Las)')
substitute_by = c('Álava', 'Alicante', 'Castellón', 'Valencia', 'Las Palmas')
for (i in 1:length(to_be_renamed)) {
criminality_df[criminality_df$province == to_be_renamed[i], 'province'] = substitute_by[i]
}
Now compute the min-string-distance:
for (i in provinces$provinces){
distances <- stringdist(criminality_df$province, i, method = 'lcs')
min_distance <- which.min(distances)
provinces[provinces$provinces == i, 'ID'] = criminality_df[min_distance, 'ID']
}
# now join the dataframes to inspect for correctness of the assignment
joined_frames <- left_join(provinces, criminality_df, by = 'ID')
#view(joined_frames)
All values have been correctly assigned.
provinces <- left_join(provinces, criminality_df, by = 'ID')
provinces <- select(provinces, -province)
# Now we joint the provinces dataset to the data-datset
colnames(provinces)[1] <- 'Province'
data <- left_join(data, provinces, by = 'Province')
A copy of the dataframe is created for security reasons.
# this is the third backup copy
data_copy3 <- data
data <- data_copy3
Finally the dataset is complete! In the next step the focus will be on
cleaning the dataset data and doing some exploratory
analysis.
In this section the datapoints stored in the dataset
data will be cleaned and transformed such, that it can be
used for numerical analysis. This includes dropping columns that will
not use be used anymore, creating new variables and new factors.
datatable(data, options = list(pageLength = 10))
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
These are the columns that we be used:
c('price', 'Tipo de inmueble', 'floor', 'Parking', 'Ascensor', 'Consumo energía', 'Emisiones', 'Antigüedad', 'Estado', 'room', 'bathroom', 'size_sqm', 'Population', 'Municipality', 'Province', 'Region', 'Renta', 'Surface', 'UnempRate', 'crime_rate')
Their names will be changed to:
c('price', 'type', 'floor', 'parking', 'lift', 'energy_con', 'pollution', 'age', 'condition', 'room', 'bathroom', 'size_sqm', 'pop', 'municipality', 'province', 'region', 'mean_salary', 'surface', 'unemp', 'crime_province')
#1 . first select only those columns that will be used in the analysis.
columns_to_be_used <- c('price', 'Tipo de inmueble', 'floor', 'Parking', 'Ascensor', 'Consumo energía', 'Emisiones', 'Antigüedad', 'Estado', 'room', 'bathroom', 'size_sqm', 'Population', 'Municipality', 'Province', 'Region', 'Renta', 'Surface', 'UnempRate', 'crime_rate')
new_names <- c('price', 'type', 'floor', 'parking', 'lift', 'energy_con', 'pollution', 'age', 'condition', 'room', 'bathroom', 'size_sqm', 'pop', 'municipality', 'province', 'region', 'mean_salary', 'surface', 'unemp', 'crime_province')
# we create anlother dataframe to work on the changed version of the dataframe data.
datav1 <- data[,columns_to_be_used]
# next, change the name of the columns.
colnames(datav1) <- new_names
Now notice that for the columns pollution and energy the values are classified from A to G, where A is the best and G is the worst classification. Along with these classifications, there is also the estimated amount of energy consumption and pollution per year. They will be transformed to numerical value as follows:
\[ s_{i}=\frac{i_{energy}}{i_{energy}+j_{polluiton}}*\left(kWhm²/año\right)+\frac{j_{pollution}}{i_{energy}+j_{polluiton}}*\left(kgCO₂m²/año\right) \]
The lower this score is, the more energy efficient is the object.
energy_letters <- list('A' = 1, 'B' = 2, 'C' = 3, 'D' = 4, 'E' = 5, 'F' = 6, 'G' = 7)
# Let us first identify if there is a column that contains no didigts in it
no_dig_energy <- sum(!grepl("\\d", datav1$energy_con))
no_dig_pollution <- sum(!grepl("\\d", datav1$pollution))
paste('There are [energy consumption; pollution]=[', no_dig_energy, ";", no_dig_pollution, "] datapoints that have no digits in it.", sep = "")
## [1] "There are [energy consumption; pollution]=[165;165] datapoints that have no digits in it."
Datapoints that have no values about energy consumption (energy and pollution) will be dropped from the dataset. As can be seen from the output above, this is the case for 165 objects.
datav1 <- datav1[grepl("\\d", datav1$pollution), ]
# Now we do the following:
# Create a dataset only with the columns for pollution and energy consumption:
# first create an ID column in the dataset datav1
datav1$ID <- seq(nrow(datav1))
df_energy <- select(datav1, energy_con, pollution, ID)
# create another column only with the letter certification and compute weights
df_energy$weights_energy <- NA
df_energy$weights_pollution <- NA
first_letters_energy <- substr(df_energy$energy_con, 1, 1)
first_letters_pollution <- substr(df_energy$pollution, 1, 1)
for (i in 1:length(first_letters_energy)) {
letter_energy <- first_letters_energy[i]
letter_pollution <- first_letters_pollution[i]
number_energy <- energy_letters[[letter_energy]]
number_pollution <- energy_letters[[letter_pollution]]
# compute the weights for energy consumption
weight_energy <- (number_energy/(number_pollution + number_energy))
weights_pollution <- 1 - weight_energy
df_energy$weights_energy[i] <- weight_energy
df_energy$weights_pollution[i] <- weights_pollution
}
Next two columns are created in the above dataframe, where only the numerical values of the columns ‘energy_con’ and ‘pollution’ will be stored.
df_energy %>%
mutate(kw_year = as.numeric(gsub("\\D", "", energy_con)),
co_year = as.numeric(gsub("\\D", "", pollution)),
energy_score = weights_energy * kw_year + weights_pollution * co_year) %>%
select(ID, energy_score) -> df_energy
Now a distribution of the scores is created. As 7 buckets are to be create, their upper bounds are computed as (100/7)*bucket_number of the distribution. Where the first percentile equals \(\frac{100}{7}=14.285714\)
# get unique scores:
quantile_i <- 1/7
probs <- c(quantile_i * 1, quantile_i * 2, quantile_i * 3, quantile_i * 4, quantile_i * 5, quantile_i * 6)
classifier <- quantile(unique(df_energy$energy_score), probs = probs)
Our energy score reads:
| Bucket | Interpretation | Numerical Value |
|---|---|---|
| (0; 58.877] | Very good | 0 |
| (58.877; 98.341) | good | 1 |
| (98.341; 139.922] | normal | 2 |
| (139.922; 179.444] | acceptable | 3 |
| (179.544; 301.143] | bad | 4 |
| (301.143; 1137.928] | very bad | 5 |
| (1137.928; \(\infty\)) | horrible | 6 |
names(classifier) <- c(0, 1, 2, 3, 4, 5)
df_energy$energy_class <- NA
# now we iterate over the dataset and assign the classifier to those values
for (i in 1:nrow(df_energy)){
logical_list <- sapply(classifier, function(x) x <= df_energy$energy_score[i])
no_true <- sum(logical_list)
if (no_true < 6) {
# get the index name with the first false value, this is the energy classification
index_false <- names(which(!logical_list)[1])
df_energy$energy_class[i] <- as.numeric(index_false)
} else {
df_energy$energy_class[i] <- 6
}
}
df_energy <- select(df_energy, ID, energy_class)
Now the datasets are joined by their ID columns. Here’s the output of the operations above:
datav1 <- left_join(datav1, df_energy, by = 'ID')
datatable(datav1, options = list(pageLength = 10))
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
The columns pollution and energy_con are not needed anymore, so they will be dropped from the dataset. Next, the columns ‘room’, ‘bathroom’, ‘unemp’, ‘surface’, ‘size_sqm’,and ‘mean_salary’ will be cleaned, so that they contain only numerical value. Further, the datatype of the column price needs to be converted from “character” to “numerical”.
# drop columns
datav1 <- select(datav1,-c(pollution, energy_con))
# substitute the commas in the unemp column to points and transform column to numerical values
datav1$unemp <- as.numeric(gsub(",", ".", datav1$unemp))
# we now clean the surface column.
# first remove all non-numerical values
datav1$surface <- gsub("[^0-9,.]", "", datav1$surface)
# now remove points that are being used as thousand-separators
datav1$surface <- gsub("\\.", "", datav1$surface)
# now substitute the commas by points and transform the values to numeric
datav1$surface <- as.numeric(gsub('\\,',".", datav1$surface))
# we now clean the column mean_salary
# remove the point working as thousand separator and transform data to numeric
datav1$mean_salary <- as.numeric(gsub('\\.', "", datav1$mean_salary))
# transform price column to be a numerical value
datav1$price <- as.numeric(datav1$price)
# transform column size_sqm to be numerical
datav1$size_sqm <- as.numeric(gsub('[^0-9,.]', "", datav1$size_sqm))
# transform column bathroom to be numerical
datav1$bathroom <- as.numeric(gsub('[^0-9]',"", datav1$bathroom))
# transform room column to be numerical
datav1$room <- as.numeric(gsub('[^0-9]',"", datav1$room))
datatable(datav1, options = list(pageLength = 10))
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
Notice the information about the surface were only obtained, so that the population density could be computed, which will be done as the next step:
datav1 <- mutate(datav1,
density_skm = pop/surface)
In this section, the goal is to understand the dataset at hand, we will for such make use of descriptive statistics and plots.
Note that the column lift has only Sí or NA data,
meaning the having a lifting system was answered either with
yes or no information about it was given at all. In this
one case, given the amount of Sí and no NA, the NAs
will be raplaced by no.
non_na_data <- table(as.factor(datav1$lift), exclude = NULL)
barplot(c(non_na_data), sum(is.na(datav1$lift)),
names.arg = c("Sí", "NA"),
xlab = 'category',
ylab = 'count'
)
So the number of si and na are:
non_na_data
##
## Sí <NA>
## 7916 4934
Now, replace the values:
datav1[is.na(datav1$lift), 'lift'] = 'no'
# and now let's change 'sí' by yes
datav1[datav1$lift == 'Sí', 'lift'] = 'yes'
Next, substitute all the NA values in the dataset by the value \(-9\).
for (column in colnames(datav1)) {
if (nrow(datav1[is.na(datav1[column]), column]) > 0) {
if (class(datav1[[column]]) == 'character') {
datav1[is.na(datav1[column]), column] = '-9'
} else {
datav1[is.na(datav1[column]), column] = -9
}
}
}
datatable(datav1, options = list(pageLength = 10))
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
If the reader take a closer look at the column floor, he/she will find some redundancy. Next, this problem will be tackled in order to reduce the number of factors in that column.
unique(datav1$floor)
## [1] "3ª Planta" "1ª Planta" "-9"
## [4] "2ª Planta" "4ª Planta" "7ª Planta"
## [7] "5ª Planta" "Bajos" "6ª Planta"
## [10] "9ª Planta" "8ª Planta" "14ª Planta"
## [13] "Principal" "Entresuelo" "11ª Planta"
## [16] "Superior a Planta 15" "10ª Planta" "Sótano"
## [19] "12ª Planta" "15ª Planta" "Subsótano"
## [22] "13ª Planta"
Notice that Principal and bajos mean the same thing. They are the ground-floor. One normally used for houses and the other for buildings, hence, we will substitute bajos by principal.
datav1[grepl('Bajos', datav1$floor), 'floor'] = 'Principal'
Note that:
subsótono (below-basement) < Sótano (basement)< Entresuelo (between basement and ground-floor)< Principal < 1 planta < 2 planta … < 15 planta < Superior a la 15 planta (above 15.th floor). It follows that this variable follows a cardinal ordering, and will thus be encoded as such.vector1 <- c(1)
for (i in 1:15){
vector1 <- append(vector1, paste0(i, 'ª Planta'))
}
vector1 <-vector1[-1]
vector0 <- c('-9', 'Subsótano', 'Sótano', 'Entresuelo', 'Principal')
vector2 <- c('Superior a Planta 15')
# join the three vectors
ordered_floors <- append(vector0, vector1)
ordered_floors <- append(ordered_floors, vector2)
ordered_floors
## [1] "-9" "Subsótano" "Sótano"
## [4] "Entresuelo" "Principal" "1ª Planta"
## [7] "2ª Planta" "3ª Planta" "4ª Planta"
## [10] "5ª Planta" "6ª Planta" "7ª Planta"
## [13] "8ª Planta" "9ª Planta" "10ª Planta"
## [16] "11ª Planta" "12ª Planta" "13ª Planta"
## [19] "14ª Planta" "15ª Planta" "Superior a Planta 15"
Now transform the column floor to be ordered_factors
datav1$floor <- factor(datav1$floor, ordered = TRUE, levels = ordered_floors)
Notice that the age column follows more or less the same syntax:
unique(datav1$age)
## [1] "-9" "50 a 70 años" "20 a 30 años" "30 a 50 años"
## [5] "Menos de 1 año" "10 a 20 años" "70 a 100 años" "5 a 10 años"
## [9] "1 a 5 años" "+ 100 años"
ordering the values yield:
ordered_ages <- c('-9', "Menos de 1 año", '1 a 5 años', "5 a 10 años", "10 a 20 años",
"20 a 30 años", "30 a 50 años", "50 a 70 años",
"70 a 100 años", "+ 100 años")
datav1$age <- factor(datav1$age, ordered = TRUE, levels = ordered_ages)
datav1$parking <- factor(datav1$parking)
datav1$region <- factor(datav1$region)
datav1$type <- factor(datav1$type)
Now we save this dataframe as the dataframe we will use in the analysis
from here on. We call it dataset
dataset <- datav1
#write.csv(dataset, file = './01_RawDatasets/00_CleanedSet.csv')
dataset <- read.csv('./01_RawDatasets/00_CleanedSet.csv')
Next, the distribution of prices is plotted in order to understand the data.
#png('./00_Pictures/02_Processed/06_pricedist.png')
options(scipen = 2, digits = 6)
dens <- density(dataset$price)
plot(dens, col = 'darkblue', ylab = 'Density', xlab = 'Prices in €', main = 'Price-Distribution')
polygon(dens, col = 'lightblue', border = NA)
grid()
#dev.off()
Notice that the distribution of prices resembles much more a \(\chi^{2}\) distribution than a normal one, this is because housing prices are always positive and outliers are not really that ‘seldom’. Some descriptive statistics allows us to obtain a clearer picture of this distribution.
# The distribution
summary(dataset$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10000 149900 239000 360380 399000 12500000
The cheapest object in the dataset is an apartment and house, located in Villanueva de Córdoba and Barcelona, respectively:
# look at the min values:
dataset[dataset$price == 10000, ]
## X price type floor parking lift age condition room bathroom
## 6543 6543 10000 Casa o chalet -9 -9 no -9 -9 1 -9
## 12602 12602 10000 Piso -9 Privado no -9 -9 1 1
## size_sqm pop municipality province region mean_salary
## 6543 104 8587 Villanueva de Córdoba Córdoba Andalucía 9167
## 12602 10 1636193 Barcelona Barcelona Cataluña 16750
## surface unemp crime_province ID energy_class density_skm
## 6543 429.52 17.77 3671.47 6543 5 19.9921
## 12602 99.11 10.15 6412.01 12602 5 16508.9000
While the most expensive object is a house located in Barcelona:
dataset[dataset$price == 12500000,]
## X price type floor parking lift age condition
## 2089 2089 12500000 Casa o chalet -9 Privado no 30 a 50 años Casi nuevo
## room bathroom size_sqm pop municipality province region mean_salary
## 2089 10 11 2781 1636193 Barcelona Barcelona Cataluña 16750
## surface unemp crime_province ID energy_class density_skm
## 2089 99.11 10.15 6412.01 2089 1 16508.9
Next, the standard deviation and skewness of this distribution are computed:
std <- sqrt(var(dataset$price))
skewness <- (sum((dataset$price - mean(dataset$price))^3) / nrow(dataset)) / (std^3)
paste('skewness: ', skewness, '. Standard Dev.:', std)
## [1] "skewness: 7.07977677236022 . Standard Dev.: 433827.506257726"
A plot of the factor values can also aid to identify the properties of the data.
factor_columns <- c('type', 'floor', 'lift', 'age', 'condition')
par(mfrow = c(2, 1), cex.axis = 0.5, cex.lab = 0.5, cex.main = 0.8)
for (i in factor_columns) {
counts <- table(dataset[i])
barplot(counts, xlab = 'Factors', ylab = 'Frequency', main = paste('Variable: ', i, '-Distribution', sep = ""))
grid()
barplot(counts, names.arg = names(counts), add = TRUE)
}
One see from the plot above that missing values play an important role for several variables.
In this section the object of study is the distribution of prices per autonomous cities and provinces in order to better understand the spatial differences inside the country.. The shapefiles used in this section come from the website arcgis
(sf_aa_cc <- st_read('./01_SF_Autonomous_Regions/Comunidades_Autonomas_ETRS89_30N.shp',
layer = 'Comunidades_Autonomas_ETRS89_30N'))
## Reading layer `Comunidades_Autonomas_ETRS89_30N' from data source
## `/Volumes/ExternalNew/Documents/01_Studies/01_University/01_Master/S.S.2023/03_Topics_in_Empirical_Research/03_Project/00_Project_LFE_8105069/01_SF_Autonomous_Regions/Comunidades_Autonomas_ETRS89_30N.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 19 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -1004500 ymin: 3132140 xmax: 1126930 ymax: 4859240
## Projected CRS: ETRS89 / UTM zone 30N
## Simple feature collection with 19 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -1004500 ymin: 3132140 xmax: 1126930 ymax: 4859240
## Projected CRS: ETRS89 / UTM zone 30N
## First 10 features:
## Codigo Texto Texto_Alt
## 1 01 Andalucía Andalucía
## 2 02 Aragón Aragón
## 3 03 Principado de Asturias Asturias
## 4 04 Islas Baleares Illes Balears
## 5 05 Canarias Canarias
## 6 06 Cantabria Cantabria
## 7 07 Castilla y León Castilla y León
## 8 08 Castilla - La Mancha Castilla - La Mancha
## 9 09 Cataluña Catalunya
## 10 10 Comunidad Valenciana Comunitat Valenciana
## geometry
## 1 MULTIPOLYGON (((280487 3993...
## 2 MULTIPOLYGON (((683851 4754...
## 3 MULTIPOLYGON (((271019 4838...
## 4 MULTIPOLYGON (((885505 4299...
## 5 MULTIPOLYGON (((-978944 317...
## 6 MULTIPOLYGON (((478590 4790...
## 7 MULTIPOLYGON (((505342 4715...
## 8 MULTIPOLYGON (((468371 4498...
## 9 MULTIPOLYGON (((825345 4515...
## 10 MULTIPOLYGON (((720291 4227...
Next, the dataset will be groupped by region using the function group_by() and the results is summarized using their means.
unique(dataset$region)
## [1] "Andalucía" "Cataluña" "Comunidad de Madrid"
## [4] "País Vasco" "Galicia" "Comunidad Valenciana"
## [7] "Aragón" "Castilla y León" "Canarias"
## [10] "Castilla-La Mancha" "Cana" "Cantabria"
## [13] "Extremadura" "Melilla" "Navarra"
## [16] "Principado de Asturias"
There is a non-existing region named Cana, which is assumed to be an error, and its name should be ‘Canaria’. This can be proofed by checking the cities in that region:
head(select(dataset[dataset$region == 'Cana', ], municipality), 5)
## municipality
## 476 Las Palmas de Gran Canaria
## 501 Santa Cruz de Tenerife
## 674 Las Palmas de Gran Canaria
## 945 Santa Cruz de Tenerife
## 1001 Las Palmas de Gran Canaria
It is clear that cities in the output above are in Canarian isles, and will thus be corrected.
dataset[dataset$region == 'Cana', 'region'] = 'Canarias'
dataset[c('price', 'region')] %>%
group_by(region) %>%
summarize(lower_quarter = quantile(price, probs = c(0.25)),
median = quantile(price, probs = c(0.5)),
upper_quarter = quantile(price, probs = c(0.75)),
mean_price = mean(price),
n = n()) -> quant_prices_region
groupped_by_region <- group_by(dataset, region)
In order to display this information in a graphical representation, one
has to merge the dataset imported from the shapefile and the
quant_prices_region. Here again method of the
min-string-distance is used. A closer look into the
sf_aa_cc-dataframe, shows a column named codigo
(code), which uniquely identifies the regions, and will be used to join
the dataframes.
# First, create a dataframe with the column codigo and names of the provinces as given in the dataset sf_aa_cc
df_sf <- as.data.frame(sf_aa_cc)
# Create a column in the quant_prices_region to store the (matched IDs)
quant_prices_region$Codigo <- NA
# Lets first take a look at the provinces name in the two datasets.
unique(sort(quant_prices_region$region))
## [1] "Andalucía" "Aragón" "Canarias"
## [4] "Cantabria" "Castilla y León" "Castilla-La Mancha"
## [7] "Cataluña" "Comunidad de Madrid" "Comunidad Valenciana"
## [10] "Extremadura" "Galicia" "Melilla"
## [13] "Navarra" "País Vasco" "Principado de Asturias"
unique(sort(df_sf$Texto))
## [1] "Andalucía" "Aragón"
## [3] "Canarias" "Cantabria"
## [5] "Castilla - La Mancha" "Castilla y León"
## [7] "Cataluña" "Ceuta"
## [9] "Comunidad de Madrid" "Comunidad Foral de Navarra"
## [11] "Comunidad Valenciana" "Extremadura"
## [13] "Galicia" "Islas Baleares"
## [15] "La Rioja" "Melilla"
## [17] "País Vasco" "Principado de Asturias"
## [19] "Región de Murcia"
The dataset quant_prices_region has fewer regions than the
one with all the regions in Spain. But at least the names are
practically the same (expect Comunidad Foral de Navarra—for the existing
regions.) Thus, the min-string-distance approach is likely to work very
well.
quant_prices_region[quant_prices_region$region == 'Navarra', 'region'] = 'Comunidad Foral de Navarra'
# Iterate over the dataset quant_prices_region, compute the string min-distance and assign the 'codigo' to it
for (i in 1:nrow(quant_prices_region)) {
distances <- stringdist(quant_prices_region$region[i], df_sf$Texto, method = 'lcs') # method: longest similar sequence
min_distance <- which.min(distances)
quant_prices_region$Codigo[i] = df_sf$Codigo[min_distance]
}
# we joint the frames and inspect it for correctness
joined_frames <- left_join(quant_prices_region, df_sf, by = 'Codigo')
#view(joined_frames)
The region have been correctly assigned. Now the regions need to be
brought inside the shapefile. Now the dataframes need to be joined using
left_join(),
where the left dataframe is the df_sf-frame.
mean_prices_to_be_joined <- select(quant_prices_region, Codigo, mean_price, median, n)
df_sf <- left_join(df_sf, quant_prices_region, by = 'Codigo')
Next the NA values are replaced with zeros:
merged_data <- merge(sf_aa_cc, mean_prices_to_be_joined, by.x = 'Codigo', by.y = 'Codigo', all.x = TRUE)
merged_data$center <- st_centroid(merged_data$geometry)
merged_data <- mutate(merged_data,
cod_text = paste(Codigo, Texto, sep = " "))
Some region names will now be shortened, so that they can be used as labels in the plots.
| Old Name | New Name |
|---|---|
| Principado de Austurias | Austurias |
| Catilla y León | Cas.y. León |
| Comunidad de Matrid | Madrid |
| Comunidad Foral de Navarra | Navarra |
| Islas Baleares | I. Baleares |
| Castilla - La Mancha | La Mancha |
| Comunidad Valenciana | Valencia |
| Region de Murcia | Murcia |
old_names <- c('Principado de Asturias', 'Castilla y León', 'Comunidad de Madrid',
'Comunidad Foral de Navarra', 'Islas Baleares', 'Castilla - La Mancha',
'Comunidad Valenciana', 'Región de Murcia')
new_names <- c('Asturias', 'Cas. y León', 'Madrid', 'Navarra', 'I. Baleares', 'La Mancha',
'Valencia', 'Murcia')
for (i in 1:length(old_names)) {
merged_data$cod_text <- gsub(old_names[i], new_names[i], merged_data$cod_text)
}
The graphic bellow shows the mean prices of real estates divided by autonomous comunities in Spain:
map_plot <- ggplot() +
geom_sf(data = merged_data, aes(fill = mean_price)) +
scale_fill_gradient2(low = "lightblue",
mid = 'white',
high = "red",
midpoint = median(merged_data$mean_price, na.rm = TRUE),
label = comma_format()) +
geom_text(data = merged_data,
aes(x = st_coordinates(center)[, 1], y = st_coordinates(center)[, 2], label = Codigo),
color = "black",
size = 3,
nudge_x = 0.1,
nudge_y = 0.1) +
labs(fill = "Mean Price", title = 'Mean Housing Prices Per Autonomous Region') +
theme_minimal() +
theme(plot.margin = margin(0, 0, 0, 0, "cm"),
axis.title.x = element_blank(),
axis.title.y = element_blank())
# Create a separate label plot with region names
label_plot <- ggplot() +
geom_text(data = merged_data, aes(x = 1, y = Codigo, label = cod_text),
color = "black",
size = 2,
hjust = 0,
vjust = 0.5) +
labs(x = "", y = "") +
theme_minimal() +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank()
)
# Combine the map plot and label plot using the patchwork package
combined_plot <- map_plot + label_plot +
plot_layout(ncol = 2, widths = c(11.5, 2.9)) +
theme(legend.text = element_text(size = 105))
# Display the combined plot
combined_plot
One should notice that the dataset has severe problem with outliers:
plot_ar <- ggplot(groupped_by_region, aes(x = region, y = price)) +
geom_boxplot() +
labs(title = "Boxplot: Prices per Autonomous Region", x = "Region", y = "Prices") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 35, hjust = 1)) +
scale_y_continuous(labels = comma_format())
plot_ar
such that it makes sense to plot the prices per region based on the median value instead of the mean.
map_plot <- ggplot() +
geom_sf(data = merged_data, aes(fill = median)) +
scale_fill_gradient2(low = "lightblue",
mid = 'white',
high = "red",
midpoint = median(merged_data$median, na.rm = TRUE),
label = comma_format()) +
geom_text(data = merged_data,
aes(x = st_coordinates(center)[, 1], y = st_coordinates(center)[, 2], label = Codigo),
color = "black",
size = 3,
nudge_x = 0.1,
nudge_y = 0.1) +
labs(fill = "Median Price", title = 'Median Housing Prices Per Autonomous Region') +
theme_minimal() +
theme(plot.margin = margin(0, 0, 0, 0, "cm"),
axis.title.x = element_blank(),
axis.title.y = element_blank())
# Create a separate label plot with region names
label_plot <- ggplot() +
geom_text(data = merged_data, aes(x = 1, y = Codigo, label = cod_text),
color = "black",
size = 2,
hjust = 0,
vjust = 0.5) +
labs(x = "", y = "") +
theme_minimal() +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank()
)
# Combine the map plot and label plot using the patchwork package
combined_plot <- map_plot + label_plot +
plot_layout(ncol = 2, widths = c(11.5, 2.9)) +
theme(legend.text = element_text(size = 105))
#ggsave('./00_Pictures/02_Processed/10_MedianPrice_Region.png', plot = map_plot)
# Display the combined plot
combined_plot
By ranking the autonomous regions acording to their median and means, one obtains
merged_data <- merged_data %>% mutate(Rank_Mean = rank(-mean_price),
Rank_Median = rank(-median),
)
display_merged <- select(merged_data, !c('geometry'))
datatable(display_merged, options = list(pageLength = 10))
Extra: Produce table to be exported to latex:
latex_table <- as.data.frame(display_merged)
latex_table <- select(latex_table, Codigo, Texto, median, Rank_Median, n)
column_names <- c('Code', 'Regions', 'Median Price in €', 'Rank-Median', 'No. Obs.')
colnames(latex_table) <- column_names
latex_table <- arrange(latex_table, `Rank-Median`)
latex_table[is.na(latex_table$`Median Price`), 'Rank-Median'] = 'NA'
write_latex_table <- kable(latex_table, format = "latex", booktabs = TRUE)
writeLines(write_latex_table, 'table.tex')
Nevertheless, one must be careful about generalization about the price distribution, since some regions are much better represented than other. The following graph shows for which regions is this the case:
price_dist_plot <- ggplot(groupped_by_region, aes(x = region, y = price)) +
geom_boxplot() +
labs(title = "Boxplot: Prices per Autonomous Region", x = "Region", y = "Prices (scale = log10)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 35, hjust = 1)) +
scale_y_continuous(labels = comma_format()) + scale_y_log10()
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
# ----------------------------------
number_units <- summarize(groupped_by_region, numbers = n())
n_per_region <-ggplot(number_units, aes(x = region, y = numbers, fill = region)) +
geom_bar(stat = 'identity') +
labs(x = 'Region', y = 'No. of Datapoints', title = 'Number of Obs. Per Autonomous Region') +
theme_minimal() +
scale_y_continuous(limits = c(0, NA)) +
theme(legend.text = element_text(size =5),
legend.key.size = unit(0.2, 'cm'),
axis.text.x = element_blank(),
legend.position = 'none') +
guides(fill = guide_legend(override.aes = list(size = 0.01, ncol = 3)))
combined_plots <- grid.arrange(price_dist_plot, n_per_region, nrow = 2, heights = c(1.5, 1))
#ggsave(plot = combined_plots, "./00_Pictures/02_Processed/11_Boxplot_Region.png",
#height = 5, width = 10)
combined_plots
## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
The same analysis as above will be done now at a province level:
groupped_by_province <- group_by(dataset, province)
groupped_by_province %>%
summarize(
lower_quarter = quantile(price, probs = 0.25),
median = quantile(price, probs = 0.5),
upper_quarter = quantile(price, probs = 0.75),
mean_price = mean(price)
) %>%
select(lower_quarter, median, upper_quarter, mean_price, province) -> quant_prices_province
quant_prices_province
## # A tibble: 40 × 5
## lower_quarter median upper_quarter mean_price province
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 162500 230000 297500 230000 Albacete
## 2 110000 165000 297500 240856. Alicante
## 3 63915 111000 165000 141029. Almería
## 4 74000 85000 98500 96488. Badajoz
## 5 185000 279250 463750 421767. Barcelona
## 6 86925 148500 214500 150483. Burgos
## 7 157250 159500 279500 226250 Cantabria
## 8 115000 148000 219900 167140. Castellón
## 9 34775 61000 90000 114278. Ciudad Real
## 10 140000 140000 140000 140000 Cuenca
## # ℹ 30 more rows
Next we import the shapefile containing the map of Spain divided in provinces.
sf_provinces <- st_read('./02_SF_Provinces/Provincias_ETRS89_30N.shp')
## Reading layer `Provincias_ETRS89_30N' from data source
## `/Volumes/ExternalNew/Documents/01_Studies/01_University/01_Master/S.S.2023/03_Topics_in_Empirical_Research/03_Project/00_Project_LFE_8105069/02_SF_Provinces/Provincias_ETRS89_30N.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 52 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -1004500 ymin: 3132140 xmax: 1126930 ymax: 4859240
## Projected CRS: ETRS89 / UTM zone 30N
We next create a column named Codigo to the the
quant_prices_province dataset and use it to store the
matched codigos by using the stringdist()
function. We first visually inspect the unique provinces in both
datasets. Names that are very different but still the same province will
be changed a priori, in order to increase the accuracy of the
string-min-distance method.
unique(sort(quant_prices_province$province))
## [1] "Álava" "Albacete" "Alicante"
## [4] "Almería" "Badajoz" "Barcelona"
## [7] "Burgos" "Cáceres" "Cádiz"
## [10] "Cantabria" "Castellón" "Ciudad Real"
## [13] "Córdoba" "Cuenca" "Granada"
## [16] "Guadalajara" "Huelva" "Huesca"
## [19] "Jaén" "Las Palmas" "León"
## [22] "Lugo" "Madrid" "Málaga"
## [25] "Melilla" "Navarra" "Palencia"
## [28] "Pontevedra" "Principado de Asturias" "Salamanca"
## [31] "Santa Cruz de Tenerife" "Segovia" "Sevilla"
## [34] "Tarragona" "Teruel" "Toledo"
## [37] "Valencia" "Valladolid" "Zamora"
## [40] "Zaragoza"
Next the names in the shapefile:
unique(sort(sf_provinces$Texto))
## [1] "Álava" "Albacete" "Alicante"
## [4] "Almería" "Asturias" "Ávila"
## [7] "Badajoz" "Barcelona" "Burgos"
## [10] "Cáceres" "Cádiz" "Cantabria"
## [13] "Castellón" "Ceuta" "Ciudad Real"
## [16] "Córdoba" "Cuenca" "Gerona"
## [19] "Granada" "Guadalajara" "Guipúzcoa"
## [22] "Huelva" "Huesca" "Islas Baleares"
## [25] "Jaén" "La Coruña" "La Rioja"
## [28] "Las Palmas" "León" "Lleida"
## [31] "Lugo" "Madrid" "Málaga"
## [34] "Melilla" "Murcia" "Navarra"
## [37] "Orense" "Palencia" "Pontevedra"
## [40] "Salamanca" "Santa Cruz de Tenerife" "Segovia"
## [43] "Sevilla" "Soria" "Tarragona"
## [46] "Teruel" "Toledo" "Valencia"
## [49] "Valladolid" "Vizcaya" "Zamora"
## [52] "Zaragoza"
The name of Principado de Asturias will be changed to Asturias in order to increase accuracy by using the method of min-string-distance.
quant_prices_province[quant_prices_province$province == 'Principado de Asturias', 'province'] = 'Asturias'
# Create a column to store the ids
quant_prices_province$Codigo <- NA
# Now we compute the mean distance by iterating over our daaset
for (i in 1:nrow(quant_prices_province)) {
quant_prices_province$Codigo[i] = sf_provinces$Codigo[which.min(stringdist(sf_provinces$Texto, quant_prices_province$province[i]))]
}
# Let us join the frames and check for correctness
joined_frames <- left_join(quant_prices_province, sf_provinces, by = 'Codigo')
#view(joined_frames)
Once correctness of the assingments is visually inspected, one can merge both datasets and use it to produce the map.
merged_data <- merge(sf_provinces, quant_prices_province, by.x = 'Codigo', by.y = 'Codigo', all.x = TRUE)
# We shorten the name of santa cruz de tenerife to S.C. Tenerife
merged_data[merged_data$Texto == 'Santa Cruz de Tenerife', 'Texto'] = 'S.C. Tenerife'
# We create a column with the names of the provinces plus their code, in order for us to use them as labels
merged_data <- mutate(merged_data, cod_text = paste(Codigo, Texto, sep = " "))
as_tibble(merged_data)
## # A tibble: 52 × 12
## Codigo Texto Texto_Alt Cod_CCAA CCAA lower_quarter median upper_quarter
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 01 Álava Araba 16 País… 192750 254500 335000
## 2 02 Albacete Albacete 08 Cast… 162500 230000 297500
## 3 03 Alicante Alacant 10 Comu… 110000 165000 297500
## 4 04 Almería Almería 01 Anda… 63915 111000 165000
## 5 05 Ávila Ávila 07 Cast… NA NA NA
## 6 06 Badajoz Badajoz 11 Extr… 74000 85000 98500
## 7 07 Islas Bal… Illes Ba… 04 Ille… NA NA NA
## 8 08 Barcelona Barcelona 09 Cata… 185000 279250 463750
## 9 09 Burgos Burgos 07 Cast… 86925 148500 214500
## 10 10 Cáceres Cáceres 11 Extr… 457000 457000 457000
## # ℹ 42 more rows
## # ℹ 4 more variables: mean_price <dbl>, province <chr>,
## # geometry <MULTIPOLYGON [m]>, cod_text <chr>
Next the map will be created.
# Get max and min coordinates
max_x <- c()
max_y <- c()
min_x <- c()
min_y <- c()
# This will be used to set the limits of the map in order to focus on the mainland
for (i in 1:length(merged_data$geometry)) {
matrix <- merged_data$geometry[[i]][[1]][[1]]
max_x <- append(max_x, max(matrix[, c(1)]))
max_y <- append(max_y, max(matrix[, c(2)]))
min_x <- append(min_x, min(matrix[, c(1)]))
min_y <- append(min_y, min(matrix[, c(2)]))
}
merged_data$center <- sf::st_centroid(merged_data$geometry)
#mainland_limits <- c(min(min_x) /5 , max(max_x), min(min_y)*1.15, max(max_y))
mainland_limits <- c(min(min_x) , max(max_x), min(min_y), max(max_y))
map_plot <- ggplot() +
geom_sf(data = merged_data, aes(fill = mean_price)) +
scale_fill_gradient2(low = "lightblue",
mid = 'white',
high = "red",
midpoint = median(merged_data$median, na.rm = TRUE),
label = comma_format()) +
geom_text(data = merged_data,
aes(x = st_coordinates(center)[, 1], y = st_coordinates(center)[, 2], label = Codigo),
color = "black",
size = 2,
nudge_x = 0.1,
nudge_y = 0.1) +
labs(fill = "Mean Price €", title = 'Mean Housing Price Per Province (Spain)') +
theme_minimal() +
theme(plot.margin = margin(0, 0, 0, 0, "cm"),
axis.title.x = element_blank(),
axis.title.y = element_blank()) +
coord_sf(xlim = mainland_limits[1:2], ylim = mainland_limits[3:4])
# Create a separate label plot with region names
label_plot <- ggplot() +
geom_text(data = merged_data, aes(x = 1, y = Codigo, label = cod_text),
color = "black",
size = 2,
hjust = 0,
vjust = 0.5) +
labs(x = "", y = "") +
theme_minimal() +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank()
)
# Combine the map plot and label plot using the patchwork package
combined_plot <- map_plot + label_plot +
plot_layout(ncol = 2, widths = c(11.5, 2.9)) +
theme(legend.text = element_text(size = 105))
# Display the combined plot
combined_plot
Let us now plot a boxplot to check for the outliers per province.
price_dist_plot <- ggplot(groupped_by_province, aes(x = province, y = price)) +
geom_boxplot() +
labs(title = "Boxplot: Prices per Province", x = "Province", y = "Prices (scale = log10)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 35, hjust = 1)) +
scale_y_continuous(labels = comma_format()) + scale_y_log10()
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
price_dist_plot
Next, the median hounsing-prices per provinces will be plotted.
merged_data$center <- sf::st_centroid(merged_data$geometry)
#mainland_limits <- c(min(min_x) /5 , max(max_x), min(min_y)*1.15, max(max_y))
mainland_limits <- c(min(min_x) , max(max_x), min(min_y), max(max_y))
map_plot <- ggplot() +
geom_sf(data = merged_data, aes(fill = median)) +
scale_fill_gradient2(low = "lightblue",
mid = 'white',
high = "red",
midpoint = median(merged_data$median, na.rm = TRUE),
label = comma_format()) +
geom_text(data = merged_data,
aes(x = st_coordinates(center)[, 1], y = st_coordinates(center)[, 2], label = Codigo),
color = "black",
size = 2,
nudge_x = 0.1,
nudge_y = 0.1) +
labs(fill = "Median Price €", title = 'Median Housing Prices Per Province (Spain)') +
theme_minimal() +
theme(plot.margin = margin(0, 0, 0, 0, "cm"),
axis.title.x = element_blank(),
axis.title.y = element_blank()) +
coord_sf(xlim = mainland_limits[1:2], ylim = mainland_limits[3:4])
# Create a separate label plot with region names
label_plot <- ggplot() +
geom_text(data = merged_data, aes(x = 1, y = Codigo, label = cod_text),
color = "black",
size = 2,
hjust = 0,
vjust = 0.5) +
labs(x = "", y = "") +
theme_minimal() +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank()
)
# Combine the map plot and label plot using the patchwork package
combined_plot <- map_plot + label_plot +
plot_layout(ncol = 2, widths = c(11.5, 2.9)) +
theme(legend.text = element_text(size = 105))
# Display the combined plot
#ggsave("./00_Pictures/02_Processed/06_MedianHousingPrices.png", plot = combined_plot)
combined_plot
What is really surprising, is that Navarra and Cáceres have a very high-median. Higher than Madrid. Maybe this is because of the sample size?
number_units <- summarize(groupped_by_province, numbers = n())
n_per_province <-ggplot(number_units, aes(x = province, y = numbers, fill = province)) +
geom_bar(stat = 'identity') +
labs(x = 'Province', y = 'No. of Datapoints', title = 'Distribution of Data Per Province') +
theme_minimal() +
scale_y_continuous(limits = c(0, NA)) +
theme(legend.text = element_text(size =5),
legend.key.size = unit(0.2, 'cm'),
axis.text.x = element_blank(),
legend.position = 'none') +
guides(fill = guide_legend(override.aes = list(size = 0.01, ncol = 3)))
combined_plots <- grid.arrange(price_dist_plot, n_per_province, nrow = 2, heights = c(1.5, 1))
#ggsave(plot = combined_plots, "./00_Pictures/02_Processed/08_data_dis_prov.png")
combined_plots
## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
Maybe the situation is better if we divide the observations by region?
number_units <- summarize(groupped_by_region, numbers = n())
max_no <- max(number_units$numbers)
ggplot(number_units, aes(x = region, y = numbers, fill = region)) +
geom_bar(stat = 'identity') +
labs(x = 'region', y = 'No. of Datapoints', title = 'Distribution of Data Per Autonomous Comunity') +
theme_minimal() +
scale_y_continuous(limits = c(0, NA)) +
theme(legend.text = element_text(size =5),
legend.key.size = unit(0.3, 'cm'),
axis.text.x = element_blank()) +
guides(fill = guide_legend(override.aes = list(size = 0.2)))
number_units
## # A tibble: 15 × 2
## region numbers
## <chr> <int>
## 1 Andalucía 3013
## 2 Aragón 613
## 3 Canarias 116
## 4 Cantabria 8
## 5 Castilla y León 165
## 6 Castilla-La Mancha 32
## 7 Cataluña 3276
## 8 Comunidad de Madrid 2185
## 9 Comunidad Valenciana 2524
## 10 Extremadura 9
## 11 Galicia 606
## 12 Melilla 1
## 13 Navarra 3
## 14 País Vasco 298
## 15 Principado de Asturias 1
Probably this data is not only highly concentraded by provinces and regions but also by cities.
groupped_by_municipality <- group_by(dataset, municipality)
mutate(groupped_by_municipality,
numbers = n(),
weights = numbers / nrow(dataset)) %>%
select(municipality, numbers, weights) -> number_units
unique(arrange(number_units, desc(numbers)))
## # A tibble: 172 × 3
## # Groups: municipality [172]
## municipality numbers weights
## <chr> <int> <dbl>
## 1 Madrid 2185 0.170
## 2 Barcelona 2140 0.167
## 3 Córdoba 1191 0.0927
## 4 Elche 1179 0.0918
## 5 Valencia 1117 0.0869
## 6 Hospitalet de Llobregat 1087 0.0846
## 7 Málaga 851 0.0662
## 8 Sevilla 832 0.0647
## 9 Zaragoza 602 0.0468
## 10 Vigo 600 0.0467
## # ℹ 162 more rows
Computing weights for regions yields:
arrange(unique(select(mutate(groupped_by_region, weights = n() / nrow(dataset)), region, weights)), desc(weights))
## # A tibble: 15 × 2
## # Groups: region [15]
## region weights
## <chr> <dbl>
## 1 Cataluña 0.255
## 2 Andalucía 0.234
## 3 Comunidad Valenciana 0.196
## 4 Comunidad de Madrid 0.170
## 5 Aragón 0.0477
## 6 Galicia 0.0472
## 7 País Vasco 0.0232
## 8 Castilla y León 0.0128
## 9 Canarias 0.00903
## 10 Castilla-La Mancha 0.00249
## 11 Extremadura 0.000700
## 12 Cantabria 0.000623
## 13 Navarra 0.000233
## 14 Melilla 0.0000778
## 15 Principado de Asturias 0.0000778
Apparently the concentration is very high for those regions containing metropolitan areas. The next code will be used to summarize the number of unique regions, provinces and cities in the dataset.
unique_locations <- tibble(
'no. cities' = length(unique(dataset$municipality)),
'no. regions' = length(unique(dataset$region)),
'no. provinces' = length(unique(dataset$province))
)
datatable(unique_locations, options = list(pageLength = 10))
In this section a linear regression model is run in order to gain understanding of the key factors driving real estate prices in Spain. The section starts by describing the variables that will be used as covariates, further, adding a little explanation about why controlling for such variable is meaningful.
<li><b>Population per city ($\mathsf{pop}$):</b> The population in the city where the object is located.</b>
</li>
<ul>
<li> As in the case of the density, the population per se can be used as a predictor to explain the prices of the housings.
</li>
</ul>
</ul>
In this section the variable as in Section
3.1.1 will be encoded. First of all, a subset named
subset with only those variables that will be used for the
model estimation is created.. Notice that the column named X is
actually an ID column, which will be retained for the case, datasets
need to be joined again.
colnames(dataset)[1] <- 'ID'
variables <- c('ID', 'price', 'room', 'bathroom', 'size_sqm', 'type', 'floor', 'condition',
'age', 'parking', 'lift', 'energy_class', 'density_skm', 'mean_salary',
'unemp', 'crime_province', 'pop')
# create columns to store the different types of housing in the new dataset (subset)
dataset[, variables] %>%
mutate(sapt = ifelse(dataset$type == 'Apartamento', 1, 0),
apt = ifelse((dataset$type == 'Piso' | type == 'Planta baja'), 1, 0),
duplex = ifelse(dataset$type == 'Dúplex', 1, 0),
penth = ifelse(dataset$type == 'Ático', 1, 0),
rhouse = ifelse(dataset$type == 'Casa adosada', 1, 0),
house = ifelse(dataset$type == 'Casa o chalet', 1, 0),
loft = ifelse((dataset$type == 'Loft' | type == 'Estudio'), 1, 0),
ranch = ifelse(dataset$type == 'Finca rústica', 1, 0)
) %>% # Now select all columns but the type one and assign it to a dataframe named subset
select(-type) %>% as_tibble() -> subset
# We save this dataset and give it the name 'RawRegressionSet'
#write.csv(subset,'./01_RawDataSets/07_RawRegressionSet.csv', fileEncoding = 'latin1')
RawRegressionSet <- read.csv('./01_RawDataSets/07_RawRegressionSet.csv', fileEncoding = 'latin1')
RawRegressionSet <- select(RawRegressionSet, -'X')
Next, The missing-values in for the variable parking will be substituted with ‘no’ —recall the explanation in Section 3.1.1. about considering missing values in this one special case (as well in the lift column) as being no, instead of non-available—afterwards all the rows containing missing values in any of the other columns will be dropped.
subset <- RawRegressionSet
subset[subset$parking == -9, 'parking'] = 'no'
# Now drop all the rows containing the value -9
subset %>%
filter(across(everything(), ~. !=-9)) -> subset
## Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
## ℹ Please use `if_any()` or `if_all()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
datatable(subset, options = list(pageLength = 10))
The dataset was now reduced to 3,157 observations. This is the dataset we will work with when doing linear regression. The next step is to encode the variables.
# First of all, create a list with the age-buckets and name them according to their encoding value
age_list <- sort(unique(subset$age))
names(age_list) <- c(7, 0, 2, 3, 4, 1, 5, 6, 0)
# Now we do the same for the floor variable
floor_list <- sort(unique(subset$floor))
names(floor_list) <- c(10, 11, 12, 13, 14, 15, 1, 2, 3, 4, 5, 6, 7, 8, 9, -1, 0, -2, -3, 16)
# We use the lists above and their names to match the values in a further step and pass their names (the econded value) to the matched rows
subset %>% mutate(
parking = ifelse(subset$parking == 'no', 0, 1), # encode the parking variable assigning 0 to no and 1 otherwise
lift = ifelse(subset$lift == 'no', 0, 1), # enconde the lift varible assigning 0 to no and 1 otherwise
condition = ifelse(subset$condition == 'A reformar', 0, # encode the condition variable assigning 0 to poor
ifelse(subset$condition == 'Bien', 1, # 1 to good
ifelse((subset$condition == 'Casi nuevo' | condition == 'Muy Bien'), 2, 0) # 2 to very almost new or very good
)
),
age = as.numeric(names(age_list)[match(age, unlist(age_list))]), # match the values of the column age with the values stored in the list above, pass the names stored in the list to the matched values.
floor = as.numeric(names(floor_list)[match(floor, unlist(floor_list))]) # do the same as above for the floor variable and finally assign the values to a dataset called LR_dataset
) -> LR_dataset
# The unemployment rate is divided by 100
#LR_dataset$unemp <- LR_dataset$unemp/100
#Save the dataset just created as the CleanedRegressionSet
#write.csv(LR_dataset, './01_RawDatasets/08_CleanedRegressionSet.csv', fileEncoding = 'latin1')
LR_dataset <- read.csv('./01_RawDatasets/08_CleanedRegressionSet.csv', fileEncoding = 'latin1')
LR_dataset <- select(LR_dataset, !X)
In the following section, we will do some exploratory analysis based on
the subset LR_dataset.
First some descriptive statistics is computed in order to understand the dataset a little better:
summary(LR_dataset)
## ID price room bathroom
## Min. : 4 Min. : 35000 Min. : 1.00 Min. : 1.00
## 1st Qu.: 3244 1st Qu.: 178000 1st Qu.: 2.00 1st Qu.: 1.00
## Median : 6372 Median : 270000 Median : 3.00 Median : 2.00
## Mean : 6372 Mean : 397330 Mean : 3.01 Mean : 1.71
## 3rd Qu.: 9542 3rd Qu.: 450000 3rd Qu.: 4.00 3rd Qu.: 2.00
## Max. :12847 Max. :5000000 Max. :11.00 Max. :21.00
## size_sqm floor condition age parking
## Min. : 21 Min. :-3.00 Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.: 72 1st Qu.: 1.00 1st Qu.:0.000 1st Qu.:4.00 1st Qu.:0.000
## Median : 92 Median : 2.00 Median :1.000 Median :5.00 Median :0.000
## Mean : 113 Mean : 2.96 Mean :0.846 Mean :4.35 Mean :0.266
## 3rd Qu.: 122 3rd Qu.: 4.00 3rd Qu.:2.000 3rd Qu.:5.00 3rd Qu.:1.000
## Max. :9618 Max. :16.00 Max. :2.000 Max. :7.00 Max. :1.000
## lift energy_class density_skm mean_salary
## Min. :0.000 Min. :0.00 Min. : 15.9 Min. : 9246
## 1st Qu.:1.000 1st Qu.:2.00 1st Qu.: 4816.7 1st Qu.:12490
## Median :1.000 Median :5.00 Median : 5688.7 Median :16750
## Mean :0.764 Mean :3.79 Mean : 9479.9 Mean :14797
## 3rd Qu.:1.000 3rd Qu.:5.00 3rd Qu.:16508.9 3rd Qu.:16750
## Max. :1.000 Max. :6.00 Max. :21406.8 Max. :17059
## unemp crime_province pop sapt
## Min. :0.0635 Min. :2687 Min. : 5910 Min. :0.0000
## 1st Qu.:0.1015 1st Qu.:5340 1st Qu.: 319515 1st Qu.:0.0000
## Median :0.1101 Median :5904 Median :1636193 Median :0.0000
## Mean :0.1158 Mean :5628 Mean :1426464 Mean :0.0187
## 3rd Qu.:0.1171 3rd Qu.:6412 3rd Qu.:1636193 3rd Qu.:0.0000
## Max. :0.2606 Max. :6412 Max. :3280782 Max. :1.0000
## apt duplex penth rhouse
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :1.000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.873 Mean :0.0253 Mean :0.0576 Mean :0.00348
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## house loft ranch
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.0165 Mean :0.0038 Mean :0.00158
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
For the variables related to the type of the housing, their mean value are complementary, i.e. they sum up to 1. 87.3% of the dataset is composed by apartments, the second type of housing more frequent in the data is penthouse, that is somewhat a type of apartment. Houses represent together \((rhouse + house) = 0.0348 + 0.0165 = 0.0513 \approx 5\%\) in the dataset. So. the average housing features in the dataset (based on the mean of the values) is an apartment, with 3 rooms, 2 bathrooms with a size of 113 sqm, further, it is located on the third floor, has a moderate to bad energy efficiency score classification (3.79), is around 50 year old (class 4.35), it has a lift but probably no parking and is located in a city that is large to very large (1636193 pop).
Due to the great difference between the mean and median in the variables \(\mathsf{price}\) and \(\mathsf{density}\), an outlier problem is much likely to occur.to_plot <- gather(LR_dataset, variable, value)
to_exclude <- c('ID', 'age', 'parking', 'lift', 'mun_class', 'sapt', 'apt', 'duplex', 'penth', 'house', 'rhouse', 'loft', 'ranch', 'condition', 'floor', 'energy_class')
cont_to_plot <- filter(to_plot, !variable %in% to_exclude)
level_plot <- ggplot(cont_to_plot, aes(x = variable, y = value)) +
geom_boxplot() +
xlab("Variable") +
ylab("Value") +
theme_minimal() +
ggtitle("Boxplot: Continuous Covariates Full Dataset") +
facet_wrap(~variable, scales = "free")
level_plot
Next the data is transformed logs-value and proofed if this can have any benefit to the analysis
log_variables <- c('bathroom', 'crime_province', 'price', 'room', 'size_sqm', 'unemp', 'pop', 'density_skm')
LR_dataset %>% # apply log to the variables in the log_variables vector and save them to another dataset names LLR_dataset
mutate_at(vars(all_of(log_variables)), log) -> LLR_dataset
The data is plotted again using this new dataset.
to_plot <- gather(LLR_dataset, variable, value)
to_exclude <- c('ID', 'age', 'parking', 'lift', 'mun_class', 'sapt', 'apt', 'duplex', 'penth', 'house', 'rhouse', 'loft', 'ranch', 'condition', 'floor', 'room', 'bathroom')
to_plot <- filter(to_plot, !variable %in% to_exclude)
ggplot(to_plot, aes(x = variable, y = value)) +
geom_boxplot() +
xlab("Variable") +
ylab("Value") +
theme_minimal() +
ggtitle("Boxplot: Continuous Covariates Log Values") +
facet_wrap(~variable, scales = "free")
The approach did not really help.The problem with outliers still exists ever after logging the values. 10% of the dataset will be trimmed—the first and the final 0.5 quantiles— which is likely to solve the problem. The focus will like mainly on the distribution of price:
boundaries <- quantile(LR_dataset$price, probs = c(0.05, 0.95))
cutLR_dataset <- LR_dataset[LR_dataset$price >= boundaries[[1]] & LR_dataset$price <= boundaries[[2]],]
Lboundaries <- quantile(LLR_dataset$price, probs = c(0.05, 0.95))
cutLLR_dataset <- LLR_dataset[LLR_dataset$price >= Lboundaries[[1]] & LLR_dataset$price <= Lboundaries[[2]],]
to_plot <- gather(cutLR_dataset, variable, value)
to_exclude <- c('ID', 'age', 'parking', 'lift', 'mun_class', 'sapt', 'apt', 'duplex', 'penth', 'house', 'rhouse', 'loft', 'ranch', 'condition', 'floor', 'room', 'bathroom')
to_plot <- filter(to_plot, !variable %in% to_exclude)
ggplot(to_plot, aes(x = variable, y = value)) +
geom_boxplot() +
xlab("Variable") +
ylab("Value") +
theme_minimal() +
ggtitle("Boxplot: Continuous Covariates 90%-Dataset") +
facet_wrap(~variable, scales = "free")
Apparently cutting 10% of the dataset did not bring substantial improvement to the outlier-issue in the level-case. While for the price variable the outliers in the bottom of the distribution were removed, this was not the case for the upper tail.
to_plot <- gather(cutLLR_dataset, variable, value)
to_exclude <- c('ID', 'age', 'parking', 'lift', 'mun_class', 'sapt', 'apt', 'duplex', 'penth', 'house', 'rhouse', 'loft', 'ranch', 'condition', 'floor', 'room', 'bathroom')
to_plot_cutted <- filter(to_plot, !variable %in% to_exclude)
logplot <- ggplot(to_plot_cutted, aes(x = variable, y = value)) +
geom_boxplot() +
xlab("Variable") +
ylab("Value") +
theme_minimal() +
ggtitle("Boxplot: Continuous Variables 90%-Dataset Log") +
facet_wrap(~variable, scales = "free")
#ggsave('./00_Pictures/02_Processed/09_BoxPContCov.png')
#ggsave('./00_Pictures/02_Processed/09_BoxPContCov.png', plot = logplot)
logplot
As can be seen from the plot above, at least for the dependent variable (price) the problem of outlier could be solved. Hence, the linear Regression will be carried out using the logged-values of those variables.
Again the average housing of this data is studied (in level):# min, max, 1st Qu, Median, Mean, 3rd Qu, Max
labels = c('Min', 'Max', '1st Qu', 'Median', 'Mean', '3rd Qu')
variables <- colnames(select(cutLR_dataset, !ID))
statistics <- data.frame(
'Variable' = variables
)
for (label in labels) {
statistics = mutate(statistics, !!label := rep(0, 23))
}
for (row in 1:nrow(statistics)) {
statistics[row, 'Min'] = round(min(cutLR_dataset[statistics[row, 'Variable']]), 2)
statistics[row, 'Max'] = round(max(cutLR_dataset[statistics[row, 'Variable']]), 2)
statistics[row, '1st Qu'] = round(quantile(unlist(cutLR_dataset[statistics[row, 'Variable']]), 0.25), 2)
statistics[row, 'Median'] = round(quantile(unlist(cutLR_dataset[statistics[row, 'Variable']]), 0.5), 2)
statistics[row, 'Mean'] = round(mean(unlist(cutLR_dataset[statistics[row, 'Variable']])), 3)
statistics[row, '3rd Qu'] = round(quantile(unlist(cutLR_dataset[statistics[row, 'Variable']]), 0.75), 2)
}
datatable(statistics, options = list(pageLength = 10))
The descriptive statistics above will be exported to latex-format.
#write_latex_table <- kable(statistics, format = "latex", booktabs = TRUE)
#writeLines(write_latex_table, 'table.tex')
So, based on the average values, the average object in the dataset
costs €343,483, is an apartment, has 3 rooms, 2 bathrooms, and a size of
107 square meters, is located in the third floor, is in somewhat good
condition, is around 50 years old, has no parking space, has a lift, and
a regular to bad energy classification, it is situated in a very large
city with 1,424,755 inhabitants, in a province where the criminality
rate per 100,000 pop. is equal to 5658 and its unemployment rate 11.52%,
the mean-salary of the city is €14,826 and its density is 9767.6
hab/km2.
LLR_dataset <- cutLLR_dataset
new_names <- c('lprice', 'lroom', 'lbathroom', 'lsize_sqm', 'ldensity_skm', 'lmean_salary',
'lunemp', 'lcrime_province', 'lpop')
colnames(LLR_dataset)[c(2, 3, 4, 5, 12, 13, 14, 15, 16)] <- new_names
write.csv(LLR_dataset, '09_CleanedLogRegressionSet.csv', fileEncoding = 'latin1')
LLR_dataset <- read.csv('09_CleanedLogRegressionSet.csv', encoding = 'latin1', row.names = NULL)
LLR_dataset <- select(LLR_dataset, -c('X'))
A closer look at the distribution of prices shall show that the outlier problem has been removed:
options(scipen = 2, digits = 6)
dens <- density(LLR_dataset$lprice)
plot(dens, col = 'darkblue', ylab = 'Density', xlab = 'Prices in €', main = 'LogPrice-Distribution')
polygon(dens, col = 'lightblue', border = NA)
grid()
Although the distribution do not really looks like a normal-one, as stated, there is no outlier anymore. The plot of a correlation matrix of the continuous variables may give some information about how the covariates influence the price.
cor_notvar <- c('ranch', 'loft', 'rhouse', 'house', 'sapt', 'duplex', 'energy_class', 'lift', 'parking', 'age', 'condition', 'floor', 'apt', 'penth', 'ID')
cor_matrix <- cor(select(LR_dataset, !cor_notvar))
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(cor_notvar)
##
## # Now:
## data %>% select(all_of(cor_notvar))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
corrplot(cor_matrix, method = 'color', type = 'full', order = 'hclust', tl.col = 'black')
The model reads:
\[\boldsymbol{\mathsf{lprice}}_{\left(2847\times1\right)}=\mathbf{X}_{\left(2847\times24\right)}\boldsymbol{\beta}_{\left(24\times1\right)}+\boldsymbol{\varepsilon}_{\left(2847\times1\right)},\,\boldsymbol{\varepsilon}\overset{i.i.d.}{\sim}{\mathbb N}\left(0,\sigma\right)\] where \(\boldsymbol{X}\) is a matrix with the covariates and a vector of \(1\)’s in its first column, \(\boldsymbol{\beta}\) is a one-column vector of predictors, starting wiht \(\mathsf{\beta_{0}}\) as its first value and \(\mathsf{\beta_{21}}\) as its last and \(\boldsymbol{\varepsilon}\) is a normally distributed vector of mean zero and homoscedastic variance. (which we will test for).
| \(i\) | \(x_{i}\) | \(Label_{i}\) |
|---|---|---|
| 0 | 1 | constant |
| 1 | lroom | log no. of rooms |
| 2 | lbathroom | log no. of bathroom |
| 3 | lsize_sqm | log built-size of property |
| 4 | floor | floor where object is located |
| 5 | condition | condition of the object [poor(0), good(1), very good(2) |
| 6 | age | age class of the object [0:<5 years to 7>100 years] |
| 7 | parking | parking space [1 if true] |
| 8 | lift | has lift [1 if true] |
| 9 | energy_class | energy class [0: best to 6:worst] |
| 10 | ldensity_skm | log pop. density per \(km^{2}\) |
| 11 | lmean_salary | log city mean salary |
| 12 | lunemp | log province unemployment rate |
| 13 | lcrime_province | log province criminality rate per 100,000 pop. |
| 14 | lpop | log city population |
| 15 | sapt | 1 if small apartment |
| 16 | apt | 1 if apartment |
| 17 | duplex | 1 if duplex |
| 18 | penth | 1 if penthouse |
| 19 | rhouse | 1 if row-house |
| 20 | house | 1 if house |
| 21 | loft | 1 if loft |
| 22 | lroom * lsize_sqm | interaction between lsize and lroom |
| 23 | age * condition | interaction between age and cond. classes |
Notice the variable \(\mathsf{ranch}\) is not included in the equation. This variable is implied and is known as the base model. The base model is the one where all the factor/Dummy-variables are set to zero. In this case the base model is about a ranch, which is very new (age = 0), and is in poor condition (condition = 0), has no lift, no parking space, but a very good energy efficiency classification.
Run the regression:model.equation <- 'lprice ~ lroom + lbathroom + lsize_sqm + floor + parking + lift + energy_class + ldensity_skm + lmean_salary + lunemp + lcrime_province + sapt + duplex + penth + rhouse + apt + house + loft + lpop + lroom * lsize_sqm + age * condition '
model <- lm(model.equation, data = LLR_dataset)
table_results <- stargazer(model,
type = 'text', # entweder als text oder als latex output
align = TRUE,
header = FALSE,
summary = TRUE,
title = 'Regression-Results',
style = 'aer' #Style anpassen, z.B., commadefault
# Summary statistics, die web sollen, z.b.n
)
##
## Regression-Results
## ==========================================================
## lprice
## ----------------------------------------------------------
## lroom -0.834***
## (0.143)
##
## lbathroom 0.421***
## (0.022)
##
## lsize_sqm 0.507***
## (0.040)
##
## floor 0.011***
## (0.003)
##
## parking 0.076***
## (0.017)
##
## lift 0.210***
## (0.017)
##
## energy_class -0.013***
## (0.004)
##
## ldensity_skm 0.057***
## (0.011)
##
## lmean_salary 0.000***
## (0.000)
##
## lunemp 0.006
## (0.049)
##
## lcrime_province 0.425***
## (0.078)
##
## sapt 0.136
## (0.159)
##
## duplex -0.016
## (0.157)
##
## penth 0.144
## (0.154)
##
## rhouse 0.014
## (0.187)
##
## apt 0.003
## (0.152)
##
## house -0.035
## (0.158)
##
## loft -0.200
## (0.188)
##
## lpop 0.081***
## (0.017)
##
## age 0.034***
## (0.007)
##
## condition 0.060***
## (0.021)
##
## lroom:lsize_sqm 0.148***
## (0.032)
##
## age:condition -0.004
## (0.004)
##
## Constant 3.892***
## (0.622)
##
## Observations 2,847
## R2 0.671
## Adjusted R2 0.668
## Residual Std. Error 0.332 (df = 2823)
## F Statistic 249.800*** (df = 23; 2823)
## ----------------------------------------------------------
## Notes: ***Significant at the 1 percent level.
## **Significant at the 5 percent level.
## *Significant at the 10 percent level.
#latex_table <- print(table_results, tabular.environment = 'longtable')
#writeLines(latex_table, 'table.tex')
summary_table <- xtable(summary(model))
attr(summary_table, 'caption') <- 'Linear Regression - Results'
colnames(summary_table) <- c('Coeff.', 'Std. Error', 't-value', 'Pr(>|t|))')
#print(summary_table, tabular.environment = 'longtable')
datatable(summary_table)
Next, the model is used to make a point estimation using the mean values of the covariates.
prediction <- predict(model)
mean(prediction)
## [1] 12.5728
In this section, the Markov Assumptions will be tested:
Define a function to interpret the tests:
print_interpretation <- function(x){
if (x$p.value < 0.05) {
print(paste('reject null hypothesis. pvalue:', x$p.value, sep = " "))
} else {
print(paste('failed to reject null hypothesis. pvalue:', x$p.value, sep = " "))
}
}
\[H_{0}:\text{error term is homoscedastic}\]
test_homos <- bptest(model, ~ fitted(model) + I(fitted(model)^2))
print_interpretation(test_homos)
## [1] "reject null hypothesis. pvalue: 5.05351855997932e-83"
plot(model$residuals,
xlab = 'unit_{i}',
ylab = 'residuals',
main = 'Residua-Plot')
\[H_{0}:\text{error term is normally distributed}\]
test_jb <- jarque.bera.test(model$residual)
print_interpretation(test_jb)
## [1] "reject null hypothesis. pvalue: 0"
hist(model$residual, breaks = 100,
main = 'Residua-Distribution',
xlab = 'observed wage - estimated wage')
3. Ramsey Reset Test for linearity of the data (functional form).
\[H_{0}:\text{The functional form is correct}\]
test_reset <- reset(model, 2:3)
print_interpretation(test_reset)
## [1] "reject null hypothesis. pvalue: 1.56956962172245e-60"
\[H_{0}:\text{No correlation in the residua}\]
test_dw <- dwtest(formula = model, alternative = 'two.sided')
print_interpretation(test_dw)
## [1] "failed to reject null hypothesis. pvalue: 0.599910543036602"
For those who are not familar with the structure of a python dictionary, it resembles a dataframe in R with row-names, where the key can be interpreted as a row and the values as the entry of the table. The general format reads {key: value}.↩︎
Notice that the city name was retrieved from the url. By looking at the url displayed in the output of Listing 1.2, the reader will find the city into ’es/es/comprar/vivienda/sevilla-capital/….↩︎